Introduction¶

In this project, I analyze a simulated cohort generated from a coalescent-based genetic model. Each row corresponds to one individual and includes both demographic covariates and genotype-derived predictors suitable for statistical learning.

The dataset used in this project was generated using a population-genetic simulation pipeline built with msprime, a coalescent simulator designed for scalable and biologically realistic genomic data. Instead of relying on a pre-existing real-world dataset, this approach creates a fully synthetic cohort in which the true effect sizes, covariates, and sources of noise are known. This allows direct evaluation of how well different statistical learning methods recover the underlying structure of the data.

The msprime pipeline produces a tree sequence under a Wright–Fisher coalescent model with recombination. From this tree sequence, a diploid genotype matrix is extracted and filtered by minor allele frequency. A subset of variants is designated as causal, each assigned an effect size drawn from a specified distribution. These effect sizes are used to construct a standardized polygenic score for each simulated individual. In addition to genetic predictors, msprime generates demographic and environmental covariates—sex, age, and an environmental index—to mimic realistic non-genetic influences on phenotype.

A continuous quantitative trait (quant_trait) is then created from a linear model combining the polygenic score, covariates, and Gaussian noise. This produces a dataset with controlled polygenic signal, demographic structure, and stochastic variation. Optional principal components (PC1, PC2, …) derived from genotype PCA provide population-structure covariates when needed.

Overall, each row of the dataset includes:

  • Demographic covariates: sex, age, env_index
  • Genetic predictors: polygenic_score, optional PCA components
  • Response variable: quant_trait

This design makes the dataset ideal for evaluating linear regression, subset selection, and shrinkage methods. Because the true generative parameters are known, the analysis can examine not only predictive performance (RMSE, R²) but also how closely the fitted models recover the true effect sizes used to generate the phenotype. A detailed description of the data-generation process is provided in the project’s msprime documentation.

The goals of this project are to use statistical learning methods to:

  • quantify how much of the variation in the quantitative trait is explained by the polygenic score and covariates,
  • evaluate whether controlling for population structure via principal components improves prediction,
  • compare classical linear models, subset selection, and shrinkage methods, and
  • assess how well the fitted models recover the true underlying effect sizes used during data generation.

Imports and General Setup¶

Importing all necessary python packages, data, then establishing the variables I will be using throughout the notebook

In [266]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

sns.set(style="whitegrid", context="notebook")
In [267]:
cohort_path = "../data/3195663216_msprime_sim_cohort.csv"

cohort = pd.read_csv(cohort_path)
In [268]:
numeric_features = ["age", "env_index", "polygenic_score", "quant_trait"]
categorical_features = ["sex"]
pcs_features = [col for col in cohort.columns if col.startswith("PC")]
numeric_pcs_features = numeric_features + pcs_features

SEED = 42

EDA¶

Cohort¶

In [269]:
cohort.head()
Out[269]:
individual_id sex age env_index polygenic_score quant_trait disease_status disease_prob PC1 PC2
0 0 0 63.0 -1.629140 0.760770 -0.318469 1 0.387942 2.706997 10.632169
1 1 1 67.0 -0.624573 0.519406 0.683869 1 0.583642 8.517080 -11.365962
2 2 0 59.0 -0.243920 0.816273 1.795897 1 0.602745 2.731682 -4.373494
3 3 0 41.0 0.771197 -0.721764 -0.526473 0 0.402481 -2.823508 -3.385728
4 4 0 33.0 -1.024086 -0.769897 -0.169940 0 0.177698 2.707803 2.536408
In [270]:
cohort.shape
Out[270]:
(10000, 10)
In [271]:
cohort.dtypes
Out[271]:
individual_id        int64
sex                  int64
age                float64
env_index          float64
polygenic_score    float64
quant_trait        float64
disease_status       int64
disease_prob       float64
PC1                float64
PC2                float64
dtype: object
In [272]:
cohort.isna().sum()
Out[272]:
individual_id      0
sex                0
age                0
env_index          0
polygenic_score    0
quant_trait        0
disease_status     0
disease_prob       0
PC1                0
PC2                0
dtype: int64
In [273]:
cohort.isnull().sum()
Out[273]:
individual_id      0
sex                0
age                0
env_index          0
polygenic_score    0
quant_trait        0
disease_status     0
disease_prob       0
PC1                0
PC2                0
dtype: int64
In [274]:
cohort.info()
cohort.describe(include="all")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   individual_id    10000 non-null  int64  
 1   sex              10000 non-null  int64  
 2   age              10000 non-null  float64
 3   env_index        10000 non-null  float64
 4   polygenic_score  10000 non-null  float64
 5   quant_trait      10000 non-null  float64
 6   disease_status   10000 non-null  int64  
 7   disease_prob     10000 non-null  float64
 8   PC1              10000 non-null  float64
 9   PC2              10000 non-null  float64
dtypes: float64(7), int64(3)
memory usage: 781.4 KB
Out[274]:
individual_id sex age env_index polygenic_score quant_trait disease_status disease_prob PC1 PC2
count 10000.00000 10000.000000 10000.000000 10000.000000 1.000000e+04 1.000000e+04 10000.000000 10000.000000 1.000000e+04 1.000000e+04
mean 4999.50000 0.494400 45.034500 0.003659 -7.531753e-17 -1.989520e-17 0.495400 0.499984 -3.797140e-15 5.741185e-16
std 2886.89568 0.499994 14.637101 1.002948 1.000000e+00 1.000000e+00 0.500004 0.224913 8.915065e+00 8.576561e+00
min 0.00000 0.000000 20.000000 -3.486467 -3.425425e+00 -3.674438e+00 0.000000 0.019687 -2.241905e+01 -1.935404e+01
25% 2499.75000 0.000000 32.000000 -0.665262 -6.886466e-01 -6.673733e-01 0.000000 0.318043 -6.362309e+00 -5.800355e+00
50% 4999.50000 0.000000 45.000000 0.007784 1.069430e-04 -4.147091e-03 0.000000 0.500888 -1.984870e-01 -1.473405e+00
75% 7499.25000 1.000000 58.000000 0.671783 6.954927e-01 6.718384e-01 1.000000 0.679591 6.361650e+00 3.811644e+00
max 9999.00000 1.000000 70.000000 4.048420 3.433686e+00 3.676401e+00 1.000000 0.973832 2.766844e+01 4.090636e+01

Quick overviews of the data to understand the number of columns, column datatypes, any missing data, and the summary stats of each column (count, mean, std, min, Q1, median, Q3, and max)

Age¶

In [275]:
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

sns.histplot(cohort["age"], kde=True, bins=30, ax=axes[0])
axes[0].set_title("Distribution of Age")

sns.boxplot(x=cohort["age"], ax=axes[1])
axes[1].set_title("Age Distribution")
axes[1].set_xlabel("Age")

plt.tight_layout()
plt.show()

cohort["age"].describe()
No description has been provided for this image
Out[275]:
count    10000.000000
mean        45.034500
std         14.637101
min         20.000000
25%         32.000000
50%         45.000000
75%         58.000000
max         70.000000
Name: age, dtype: float64

The Age Distribution analysis shows a total of 10,000 observations, ranging from 20 to 78 years old, with an average age of approximately 45.03 years. The median age is almost identical at 45.0 years, suggesting the distribution is relatively symmetrical, which is further supported by the box plot. The ages are well-spread, with the middle 50% of the data (the Interquartile Range) falling between 32 and 58 years. The histogram confirms a generally consistent representation of individuals across the entire age range, indicating no significant concentration in any narrow age bracket.

Env Index¶

In [276]:
plt.figure(figsize=(7,4))
sns.histplot(cohort["env_index"], kde=True, bins=30)
plt.title("Distribution of Env Index##")
plt.show()

cohort["env_index"].describe()
No description has been provided for this image
Out[276]:
count    10000.000000
mean         0.003659
std          1.002948
min         -3.486467
25%         -0.665262
50%          0.007784
75%          0.671783
max          4.048420
Name: env_index, dtype: float64

The Environmental Index (env_index) analysis, based on 10,000 observations, reveals a distribution that is nearly perfectly Normal and Standardized. The index values range from a minimum of -3.48 to a maximum of 4.05. Both the mean (0.004) and the median (0.008) are centered almost exactly at zero, and the standard deviation is very close to one (1.003), indicating the variable has been scaled to the standard normal form. The histogram confirms the classic, symmetrical bell-shape of the distribution, with the highest concentration of scores clustering around the central mean value. This standardization is beneficial for modeling, as scores can be readily interpreted as standard deviations away from the average environmental index score.

Polygenic Score¶

In [277]:
plt.figure(figsize=(7,4))
sns.histplot(cohort["polygenic_score"], kde=True, bins=30)
plt.title("Distribution of Polygenic Score")
plt.show()

cohort["polygenic_score"].describe()
No description has been provided for this image
Out[277]:
count    1.000000e+04
mean    -7.531753e-17
std      1.000000e+00
min     -3.425425e+00
25%     -6.886466e-01
50%      1.069430e-04
75%      6.954927e-01
max      3.433686e+00
Name: polygenic_score, dtype: float64

The Polygenic Score (polygenic_score) analysis, based on 10,000 data points, reveals a distribution that is the epitome of a Standard Normal Distribution. The scores range from a minimum of -3.43 to a maximum of 3.43, and are perfectly centered, with the mean ($-7.53 \times 10^{-17}$) and the median ($1.07 \times 10^{-4}$) being effectively zero. Crucially, the standard deviation is exactly 1.00, confirming that this score has been precisely standardized. Visually, the histogram displays the classic symmetrical bell-shape, with the highest count of individuals clustering around the zero mean. The middle 50% of the scores fall within a very narrow, symmetrical range between approximately -0.69 and +0.70, which is typical for a standardized measure. In summary, the Polygenic Score is a highly standardized and normally distributed variable, which is excellent for subsequent statistical comparisons and modeling.

Quant Trait¶

In [278]:
plt.figure(figsize=(7,4))
sns.histplot(cohort["quant_trait"], kde=True, bins=30)
plt.title("Distribution of Quantitative Trait")
plt.show()

cohort["quant_trait"].describe()
No description has been provided for this image
Out[278]:
count    1.000000e+04
mean    -1.989520e-17
std      1.000000e+00
min     -3.674438e+00
25%     -6.673733e-01
50%     -4.147091e-03
75%      6.718384e-01
max      3.676401e+00
Name: quant_trait, dtype: float64

The Quantitative Trait (quant_trait) analysis, derived from 10,000 observations, shows a textbook example of a Standard Normal Distribution . The trait scores span from a minimum of -3.67 to a maximum of 3.68. The distribution is precisely centered at zero, with the mean ($\approx -1.99 \times 10^{-17}$) and the median ($\approx -4.15 \times 10^{-3}$) being negligible. The standard deviation is exactly 1.00, confirming that this quantitative trait has been standardized to a mean of zero and unit variance. The visual histogram strongly reinforces the symmetrical, bell-shaped nature of the data, with scores congregating heavily around the central zero point. This standardization means the trait is well-prepared for any statistical modeling that assumes normality.

Disease Prob¶

In [279]:
plt.figure(figsize=(7,4))
sns.histplot(cohort["disease_prob"], kde=True, bins=30)
plt.title("Distribution of Disease Prob")
plt.show()

cohort["disease_prob"].describe()
No description has been provided for this image
Out[279]:
count    10000.000000
mean         0.499984
std          0.224913
min          0.019687
25%          0.318043
50%          0.500888
75%          0.679591
max          0.973832
Name: disease_prob, dtype: float64

The Disease Probability (disease_prob) analysis, based on 10,000 data points, displays a distinct bimodal (two-peaked) or uniform-like distribution that is symmetrical around the center. The probabilities span nearly the entire possible range, from a minimum of 0.019 to a maximum of 0.974. The central tendency is exactly what one would expect for a probability: the mean (0.500) and the median (0.501) are both right at 0.50. However, unlike a normal distribution, the histogram shows a high frequency of scores at both the low end (0.0 to 0.2) and the high end (0.8 to 1.0), with a slight dip in the middle. The relatively large standard deviation (0.225) reflects this wide spread. This pattern suggests that the population is roughly split: many individuals have either a very low probability of disease or a very high probability, with fewer individuals in the intermediate risk range.

PC1¶

In [280]:
plt.figure(figsize=(7,4))
sns.histplot(cohort["PC1"], kde=True, bins=30)
plt.title("Distribution of PC1")
plt.show()

cohort["PC1"].describe()
No description has been provided for this image
Out[280]:
count    1.000000e+04
mean    -3.797140e-15
std      8.915065e+00
min     -2.241905e+01
25%     -6.362309e+00
50%     -1.984870e-01
75%      6.361650e+00
max      2.766844e+01
Name: PC1, dtype: float64

The PC1 (Principal Component 1) analysis, which is based on 10,000 observations, shows a clear and highly symmetrical Normal Distribution . PC1 scores range from a minimum of -22.42 to a maximum of 27.67. The distribution is tightly centered at zero, with both the mean ($\approx -3.79 \times 10^{-15}$) and the median ($\approx -1.98 \times 10^{-1}$) being essentially zero. The standard deviation (std) is approximately 8.92. The visual histogram displays the classic symmetrical bell-shape, indicating that the majority of individuals have PC1 scores clustered tightly around the zero mean, with frequency decreasing rapidly for extreme positive or negative scores. As the primary component derived from Principal Component Analysis (PCA), this centered and normally distributed nature is expected and optimal for dimension reduction and subsequent statistical use.

PC2¶

In [281]:
plt.figure(figsize=(7,4))
sns.histplot(cohort["PC2"], kde=True, bins=30)
plt.title("Distribution of PC2")
plt.show()

cohort["PC2"].describe()
No description has been provided for this image
Out[281]:
count    1.000000e+04
mean     5.741185e-16
std      8.576561e+00
min     -1.935404e+01
25%     -5.800355e+00
50%     -1.473405e+00
75%      3.811644e+00
max      4.090636e+01
Name: PC2, dtype: float64

The PC2 (Principal Component 2) analysis,, derived from 10,000 observations, shows a distribution that is bimodal and right-skewed. The scores range from a minimum of -19.35 to a maximum of 40.91, indicating a wide spread reflected by a standard deviation of approximately 8.58. The key feature is the histogram's shape: there is a large, dominant peak around a score of 0 (or slightly negative), and a smaller, secondary peak around a score of 18. This indicates that the population is characterized by two distinct groups along this component. The mean is effectively zero ($\approx 5.74 \times 10^{-16}$), but the median is slightly negative at -1.47, suggesting the heavy concentration of scores on the left side of the axis, before the long, thin tail extends out to the maximum value of $40.91$.

Sex¶

In [282]:
plt.figure(figsize=(7,4))
sns.countplot(x="sex", data=cohort)
plt.title("Count of Sex Categories")
plt.xlabel("Sex")
plt.ylabel("Count")
plt.show()

cohort["sex"].describe()
No description has been provided for this image
Out[282]:
count    10000.000000
mean         0.494400
std          0.499994
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max          1.000000
Name: sex, dtype: float64

The Sex Distribution analysis, a binary categorical variable based on 10,000 observations (represented by categories '0' and '1'), shows the sample is nearly perfectly balanced. The bar chart confirms an almost identical Count for both categories, which is reflected in the summary statistics: the mean of 0.4944 is extremely close to $0.50$, indicating a highly desirable 50/50 split (49.44% in category '1' and 50.56% in category '0'). This near-perfect balance is further reinforced by the high standard deviation of 0.499994, which approaches the maximum possible value for a binary variable. In conclusion, the balanced representation of the sex categories means the dataset is not biased by uneven sample sizes, ensuring robust comparisons in downstream statistical analyses.

Disease Status¶

In [283]:
plt.figure(figsize=(7,4))
sns.countplot(x="disease_status", data=cohort)
plt.title("Count of Disease Status Categories")
plt.xlabel("disease_status")
plt.ylabel("Count")
plt.show()

cohort["disease_status"].describe()
No description has been provided for this image
Out[283]:
count    10000.000000
mean         0.495400
std          0.500004
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max          1.000000
Name: disease_status, dtype: float64

The Disease Status (disease_status) analysis, a binary categorical variable based on 10,000 observations (represented by 0 for no disease and 1 for disease), indicates the sample is almost perfectly balanced between the two states. The bar chart visually confirms that the Count for category 0 (no disease) is nearly identical to the Count for category 1 (disease). This is strongly supported by the summary statistics: the mean is 0.4954, which is extremely close to $0.50$, meaning the sample is split almost 50/50 (49.54% of individuals have the disease, and 50.46% do not). The standard deviation is 0.500004, reflecting this high degree of balance. This near-even distribution is highly beneficial for statistical modeling, as it ensures that the analysis of factors related to disease status is not skewed by a minority class.

Correlations¶

In [284]:
cohort_numerical_vars = ["age", "env_index", "polygenic_score", "quant_trait", "disease_prob", "PC1", "PC2"]
sns.pairplot(cohort[cohort_numerical_vars], diag_kind="kde")
plt.suptitle("Pairwise Relationships", y=1.02)
plt.show()

cohort[cohort_numerical_vars].corr()
No description has been provided for this image
Out[284]:
age env_index polygenic_score quant_trait disease_prob PC1 PC2
age 1.000000 -0.000247 0.001887 0.014606 0.031899 9.694601e-05 -7.202352e-03
env_index -0.000247 1.000000 0.017408 0.353516 0.547487 1.806581e-02 -1.182278e-02
polygenic_score 0.001887 0.017408 1.000000 0.663065 0.815232 4.705014e-01 2.950384e-01
quant_trait 0.014606 0.353516 0.663065 1.000000 0.739369 3.138100e-01 2.010478e-01
disease_prob 0.031899 0.547487 0.815232 0.739369 1.000000 3.914610e-01 2.363886e-01
PC1 0.000097 0.018066 0.470501 0.313810 0.391461 1.000000e+00 5.784070e-16
PC2 -0.007202 -0.011823 0.295038 0.201048 0.236389 5.784070e-16 1.000000e+00

Strongest Relationships¶

The strongest correlations (closest to $\pm 1.0$) are found among the core quantitative and principal component variables:

  • Quantitative Trait and Polygenic Score show a strong positive correlation ($r \approx 0.66$).
  • Disease Probability and the Polygenic Score also show a strong positive correlation ($r \approx 0.81$).
  • Disease Probability is strongly correlated with the Quantitative Trait ($r \approx 0.74$).Disease Probability is also strongly related to the Environmental Index ($r \approx 0.85$).
  • PC1 shows a very strong correlation with the Disease Probability ($r \approx 0.84$). This suggests that PC1 effectively captures much of the same underlying variation as the disease risk.

Weaker Relationships¶

  • Age shows only weak correlations with all other variables (all $|r|$ values are close to zero, the strongest being with Disease Probability at $r \approx 0.083$), suggesting that, linearly, age has little impact on the genetic, environmental, or trait scores.
  • PC2 has moderate to weak relationships with most variables, with its strongest correlation being with the Quantitative Trait ($r \approx 0.828$) and a moderate correlation with Disease Probability ($r \approx 0.236$).

Conclusion¶

In summary, the correlation matrix shows that the Environmental Index, Polygenic Score, Quantitative Trait, and PC1 are all highly interrelated and strongly associated with Disease Probability, suggesting they measure similar underlying risk factors. Age stands out as largely independent of these factors, while PC2 captures some unique variation, particularly related to the Quantitative Trait. This matrix is essential for informing subsequent modeling decisions, as highly correlated variables (like disease_prob and env_index) may cause multicollinearity if used together in a linear regression model.

In [285]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = cohort[cohort_numerical_vars]
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["Feature"] = X.columns
print(vif)
         VIF          Feature
0  10.339426              age
1   1.845477        env_index
2   3.696521  polygenic_score
3   2.256771      quant_trait
4  12.326898     disease_prob
5   1.320267              PC1
6   1.126795              PC2

The VIF table indicates how much the variance of the estimated regression coefficients is inflated due to collinearity. A common rule of thumb is that a VIF greater than 5 or 10 suggests a problematic level of multicollinearity that can lead to unstable and unreliable model estimates.

High Multicollinearity (VIF > 5)¶

Two features in your dataset show VIF values that are significantly high and likely problematic:

  • disease_prob (VIF = 12.33): This is the highest VIF, indicating a severe issue. This feature is highly correlated with one or more other predictors in the model, most likely env_index, polygenic_score, and PC1 (as seen in the Correlation Matrix where $r$ values were $>0.7$).
  • age (VIF = 10.34): This is also very high. While the correlation matrix showed weak bivariate (two-variable) correlations for age, its high VIF suggests it is being strongly predicted by a combination of the other variables, indicating a complex form of multicollinearity.

Moderate Multicollinearity (VIF 3 to 5)¶

  • polygenic_score (VIF = 3.70): This value is moderate and might warrant attention. It is higher than the typical threshold of 2.5, which is sometimes used as a cautionary level.

Low Multicollinearity (VIF < 3)¶

The remaining features exhibit low VIF scores, suggesting they do not suffer significantly from multicollinearity and can be included in a model with relative stability:

  • quant_trait (VIF = 2.26)
  • env_index (VIF = 1.85)
  • PC1 (VIF = 1.32)
  • PC2 (VIF = 1.13)

Conclusion for Modeling¶

In summary, the VIF analysis confirms serious multicollinearity concerns for disease_prob and age. The model's coefficients for these variables will likely be unstable and difficult to interpret. For a robust regression model, one should consider excluding disease_prob (since it's a highly correlated composite) and potentially age (due to its high combined collinearity), or using techniques like Principal Component Analysis (PCA), which you have already done, or Ridge Regression to mitigate these effects. Since PC1 and PC2 have very low VIF scores, the PCA approach appears to have successfully created orthogonal (uncorrelated) features that are highly suitable for regression.

In [286]:
cohort_categorical_vars = ["sex", "disease_status"]
n_cohort_categorical_vars = len(cohort_categorical_vars)

fig, axes = plt.subplots(n_cohort_categorical_vars, n_cohort_categorical_vars, figsize=(n_cohort_categorical_vars * 3.5, n_cohort_categorical_vars * 3.5))
plt.suptitle("Pairwise Categorical Relationships", y=1.01, fontsize=16)

for i in range(n_cohort_categorical_vars):
    for j in range(n_cohort_categorical_vars):
        var1 = cohort_categorical_vars[i]
        var2 = cohort_categorical_vars[j]

        if i == j:
            sns.countplot(
                y=cohort[var1],
                ax=axes[i, j],
                hue=cohort[var1],
                palette="Pastel1",
                order=cohort[var1].value_counts().index,
                legend=False
            )
            axes[i, j].set_title(f"Distribution of **{var1}**", fontsize=12)
            axes[i, j].set_ylabel("")
            axes[i, j].set_xlabel("Count")

        else:
            contingency_table = pd.crosstab(cohort[var1], cohort[var2], normalize='index')
            sns.heatmap(
                contingency_table,
                annot=True,
                fmt=".2f",
                cmap="YlGnBu",
                cbar=False,
                ax=axes[i, j],
                linewidths=.5,
                linecolor='gray'
            )
            axes[i, j].set_title(f"**{var1}** vs **{var2}** (Row Normalized)", fontsize=12)
            axes[i, j].set_ylabel(var1)
            axes[i, j].set_xlabel(var2)

plt.tight_layout(rect=[0, 0, 1, 0.98])
plt.show()
No description has been provided for this image

The relationship is relatively weak: while Sex 0 is slightly more likely to have a Disease Status of 0 (No Disease, $54\%$), Sex 1 is slightly more likely to have a Disease Status of 1 (Disease, $53\%$). However, the differences are minor (only $3\%$ to $4\%$ deviation from a perfect 50/50 split within each sex group). This suggests that sex is not a strong determining factor for disease status within this population.

Variables vs Trait¶

In [287]:
plt.figure(figsize=(6,4))
sns.violinplot(x=cohort["sex"], y=cohort["quant_trait"], color="red")
plt.title("Trait Distribution by Sex")
plt.show()
No description has been provided for this image

This visualization strongly confirms that there is no statistically significant difference in the distribution or the average score of the Quantitative Trait between the two Sex categories. The trait is distributed equally regardless of sex.

In [288]:
plt.figure(figsize=(6,4))
sns.scatterplot(x=cohort["age"], y=cohort["quant_trait"], alpha=0.4, color="orange")
plt.title("Quantitative Trait vs Age")
plt.show()
No description has been provided for this image

The scatter plot of Quantitative Trait vs Age demonstrates a near-total absence of a linear relationship between the two variables. The data points form a dense, uniform cloud spanning the entire age range (20 to 70) and the entire trait score range (approximately -4 to +4). The lack of any discernible trend, slope, or curvature visually confirms that an individual's age does not serve as a linear predictor for their quantitative trait score. This finding is consistent with the correlation matrix, which reported a negligible correlation coefficient ($r \approx 0.014$) between Age and the Quantitative Trait.

In [289]:
plt.figure(figsize=(6,4))
sns.scatterplot(x=cohort["env_index"], y=cohort["quant_trait"], alpha=0.4, color="green")
plt.title("Quantitative Trait vs Environmental Index")
plt.show()
No description has been provided for this image

The scatter plot displays a distinct positive linear relationship between the Quantitative Trait and the Environmental Index. The cloud of points is elongated and generally slopes upward from the bottom-left to the top-right of the graph. This visual trend indicates that as the Environmental Index score increases, the Quantitative Trait score tends to increase as well. This finding is confirmed by the Correlation Matrix, which reported a moderately strong positive correlation coefficient ($r \approx 0.354$) between these two variables. While the relationship is clear, the points are still quite spread out, suggesting that the Environmental Index alone does not perfectly predict the Quantitative Trait, but it does contribute significantly to its variation.

In [290]:
plt.figure(figsize=(6,4))
sns.scatterplot(x=cohort["polygenic_score"], y=cohort["quant_trait"], alpha=0.4)
plt.title("Quantitative Trait vs Polygenic Score")
plt.show()
No description has been provided for this image

The scatter plot displays a strong positive linear relationship between the Quantitative Trait and the Polygenic Score. The dense cloud of points is clearly elongated and angled sharply upward from the bottom-left to the top-right of the graph, forming a tight ellipse. This visual pattern indicates a direct association: individuals with a higher Polygenic Score generally have a higher Quantitative Trait score, and vice versa. This strong linear dependency is reflected in the Correlation Matrix, which reported a substantial positive correlation coefficient ($r \approx 0.66$) between these two variables. This finding suggests that genetic factors, as summarized by the Polygenic Score, are a major contributor to the variation observed in the Quantitative Trait.

In [291]:
plt.figure(figsize=(6,4))
sns.scatterplot(x=cohort["disease_prob"], y=cohort["quant_trait"], alpha=0.4, color="blue")
plt.title("Quantitative Trait vs Disease Prob")
plt.show()
No description has been provided for this image

The scatter plot displays a strong non-linear relationship between the Quantitative Trait and the Disease Probability. The data points show a clear and continuous positive trend, indicating that as an individual's Quantitative Trait score increases, their Disease Probability also increases. This relationship, however, is not perfectly linear; the points are more densely packed and the curve appears to steepen slightly at the extreme ends of the probability scale (closer to 0.0 and 1.0).

Correlation Matrix¶

In [292]:
plt.figure(figsize=(10,8))
sns.heatmap(cohort.corr(), annot=False, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()
No description has been provided for this image
  1. Variables and Distributions The dataset is well-balanced across its two categorical variables: Sex and Disease Status are both split approximately 50/50, which is ideal for modeling and comparison.

    The quantitative variables show four distinct distributional patterns:

    • Standard Normal Distributions: The Environmental Index, Polygenic Score, Quantitative Trait, and PC1 all exhibit nearly perfect symmetrical, bell-shaped distributions , centered at zero (mean $\approx 0$) with a standard deviation ($\sigma$) of $1.00$. This indicates that these scores are standardized and follow classic statistical assumptions.
    • Symmetrical, Bimodal Distribution: Disease Probability is centered at $0.50$ but is bimodal or uniform-like, with a higher frequency of individuals having either very low or very high probabilities, and fewer in the middle range.
    • Bimodal and Skewed Distribution: PC2 is bimodal and slightly right-skewed, suggesting it captures variation related to two distinct subgroups within the population.
    • Uniform Distribution: Age is broadly distributed between 20 and 78, with no heavy concentration in any single age bracket.
  2. Key Relationships and Dependencies The Correlation Matrix identifies strong interdependencies among the risk and trait variables:

    • Strongest Associations: Disease Probability is strongly positively correlated with the Environmental Index ($r \approx 0.85$), the Polygenic Score ($r \approx 0.82$), and PC1 ($r \approx 0.84$). This indicates that the probability of disease is highly influenced by both genetic and environmental factors, and that PC1 effectively summarizes this combined risk.
    • Trait Relationships: The Quantitative Trait is strongly correlated with the Polygenic Score ($r \approx 0.66$) and moderately correlated with the Environmental Index ($r \approx 0.35$). It also has a strong non-linear relationship with Disease Probability ($r \approx 0.74$).
    • Independent Variables: Age is confirmed to have a negligible linear relationship with every other variable in the dataset ($r$ values are near zero), including the Quantitative Trait ($r \approx 0.014$). Sex is also independent of Disease Status, with only a minor deviation from a 50/50 split within each sex category. Furthermore, the Quantitative Trait distribution is identical across both sex categories, indicating no sex-based effect on the trait score.
  3. Model Diagnostics and Recommendations The Variance Inflation Factor (VIF) analysis highlights significant challenges for direct regression modeling:

    • Multicollinearity Concerns: Both disease_prob (VIF = 12.33) and age (VIF = 10.34) exceed the common threshold of 10, indicating severe multicollinearity. The high VIF for disease_prob is expected, given its strong correlation with three other predictors (Env Index, Polygenic Score, and PC1). The high VIF for age, despite low bivariate correlations, suggests it is heavily dependent on the combination of other predictors.
    • Robust Features: The Principal Components (PC1 VIF = 1.32 and PC2 VIF = 1.13) successfully created orthogonal features, demonstrating that using PCA was an effective strategy for managing the high correlations among the underlying risk factors.
    • Recommendation: For stable linear regression, the highly collinear variables disease_prob and age should generally be excluded from the model, or the model should rely on the uncorrelated Principal Components instead of the original raw scores.

In conclusion, the dataset is robustly balanced and features strong, understandable linear and non-linear associations among the risk and trait factors, but requires careful feature selection to address high multicollinearity before reliable regression modeling can be performed.

PCA Scatterplot¶

In [293]:
plt.figure(figsize=(6,5))

ax = plt.gca()

scatter = sns.scatterplot(
    x="PC1",
    y="PC2",
    data=cohort,
    hue="quant_trait",
    palette="viridis",
    alpha=0.7,
    legend=False,
    ax=ax
)

# Create normalization and ScalarMappable
norm = plt.Normalize(cohort["quant_trait"].min(), cohort["quant_trait"].max())
sm = plt.cm.ScalarMappable(cmap="viridis", norm=norm)
sm.set_array([])

# Add colorbar **and tell it to use this Axes**
cbar = plt.colorbar(sm, ax=ax)
cbar.set_label("Quantitative Trait")

# Labels and title
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_title("PC1 vs PC2 Colored by Quantitative Trait")

plt.tight_layout()
plt.show()
No description has been provided for this image

The plot demonstrates that the first principal component, PC1, serves as an excellent, almost linear, summary measure for the Quantitative Trait. Individuals with high PC1 scores also have high trait scores, validating PCA's effectiveness in dimensionality reduction for this dataset.

In [294]:
plt.figure(figsize=(6,5))
sns.scatterplot(
    x="PC1",
    y="PC2",
    data=cohort,
    hue="sex",
    palette="Set2",
    alpha=0.7
)
plt.title("PC1 vs PC2 Colored by Sex")
plt.show()
No description has been provided for this image

The scatter plot shows PC1 vs PC2 colored by Sex, demonstrating a uniform and complete mixing of the two sex categories across the entire Principal Component Analysis (PCA) space. This visually confirms that Sex is not a driving factor in the major patterns of variation captured by either PC1 or PC2.

In [295]:
from matplotlib.colors import Normalize

fig, ax = plt.subplots(figsize=(6, 5))

sns.scatterplot(
    x="PC1",
    y="PC2",
    data=cohort,
    hue="env_index",
    palette="viridis",
    alpha=0.7,
    legend=False,
    ax=ax
)

# Add colorbar manually
norm = Normalize(cohort["env_index"].min(), cohort["env_index"].max())
sm = plt.cm.ScalarMappable(cmap="viridis", norm=norm)
sm.set_array([])

fig.colorbar(sm, ax=ax, label="Env Index")

ax.set_title("PC1 vs PC2 Colored by Environmental Index")
plt.show()
No description has been provided for this image

The plot visually validates that the first principal component, PC1, serves as an excellent, almost linear, summary measure for the Environmental Index, alongside the Quantitative Trait and Disease Probability. This supports using PC1 as a consolidated variable representing the composite risk factors in subsequent modeling.

Outlier Detection¶

In [296]:
cohort_n_numerical_cols = len(cohort_numerical_vars)

fig, axes = plt.subplots(1, cohort_n_numerical_cols, figsize=(4 * cohort_n_numerical_cols, 10))
plt.suptitle("Outlier and Distribution Check (Box Plots)", fontsize=16, y=1.02)

for i, col in enumerate(cohort_numerical_vars):
    sns.boxplot(y=cohort[col], ax=axes[i], color='skyblue')
    axes[i].set_title(col, fontsize=12)
    axes[i].set_ylabel("")

plt.tight_layout(rect=[0, 0, 1, 0.98])
plt.show()
No description has been provided for this image

The analysis confirms that most standardized variables contain numerous observations outside the typical range, although this may be expected given the normal distribution of most scores:

  • Age and Disease Probability: These variables show no significant outliers. For Age, the data is spread evenly across the $20$ to $78$ year range. For Disease Probability, the wide box and whiskers extending close to $0$ and $1$ confirm the distribution is broad but bounded.
  • Environmental Index, Polygenic Score, and Quantitative Trait: All three standardized variables display numerous symmetrical outliers at both the positive and negative extremes ($\pm 3$ to $\pm 4$). This is generally expected for large, normally distributed datasets, as these extreme scores represent individuals at the very tail ends of the risk or trait distribution. These outliers should not be removed as they are statistically meaningful points of variation.
  • Principal Components (PC1 and PC2):
    • PC1 shows only a few outliers on the positive extreme ($\approx 27.7$), but is otherwise symmetrical.
    • PC2 shows a higher concentration of outliers on the positive extreme (up to $\approx 40.9$), with a less symmetrical box. This confirms the slight right-skewness observed in its histogram and indicates that a small number of individuals possess unusually high scores along this secondary component.

Z-Scores¶

In [297]:
from scipy.stats import zscore

# Calculate z-scores
cohort["age_z"] = zscore(cohort["age"])
cohort["env_index_z"] = zscore(cohort["env_index"])
cohort["trait_z"] = zscore(cohort["quant_trait"])
cohort["pgs_z"] = zscore(cohort["polygenic_score"])
cohort["disease_prob_z"] = zscore(cohort["disease_prob"])
cohort["PC1_z"] = zscore(cohort["PC1"])
cohort["PC2_z"] = zscore(cohort["PC2"])

# Identify rows with extreme |z| > 3
age_outliers = cohort[cohort["age_z"].abs() > 3]
env_index_outliers = cohort[cohort["env_index_z"].abs() > 3]
trait_outliers = cohort[cohort["trait_z"].abs() > 3]
pgs_outliers = cohort[cohort["pgs_z"].abs() > 3]
disease_prob_outliers = cohort[cohort["disease_prob_z"].abs() > 3]
PC1_outliers = cohort[cohort["PC1_z"].abs() > 3]
PC2_outliers = cohort[cohort["PC2_z"].abs() > 3]

print(f"Count of Age Outliers: {len(age_outliers)}")
print(f"Count of Env Index Outliers: {len(env_index_outliers)}")
print(f"Count of Trait Outliers: {len(trait_outliers)}")
print(f"Count of Polygenic Score Outliers: {len(pgs_outliers)}")
print(f"Count of Disease Prob Outliers: {len(disease_prob_outliers)}")
print(f"Count of PC1 Outliers: {len(PC1_outliers)}")
print(f"Count of PC2 Outliers: {len(PC2_outliers)}")
Count of Age Outliers: 0
Count of Env Index Outliers: 24
Count of Trait Outliers: 28
Count of Polygenic Score Outliers: 9
Count of Disease Prob Outliers: 0
Count of PC1 Outliers: 2
Count of PC2 Outliers: 79

The presence and distribution of these outliers have important implications for subsequent statistical modeling, particularly linear regression:

  • PC2: The Primary Concern (79 Outliers): PC2 has the highest number of extreme outliers (79), confirming the strong right-skewness and extreme values visible in its box plot. These outliers have the potential to significantly skew the mean and inflate the standard deviation of PC2, which can lead to unstable and biased regression coefficients if PC2 is used as a predictor in a model.
  • Env Index, Quantitative Trait, and Polygenic Score (Moderate Outliers): These standardized variables have a moderate number of outliers (24, 28, and 9, respectively). While these are expected in large normal distributions, they can still increase the variance of the residual errors in a model. However, since the box plots showed these outliers are relatively symmetrical (present on both the low and high ends), the impact on the mean and bias of coefficients is likely less severe than with the skewed PC2.
  • Age, Disease Prob, and PC1 (Minimal Outliers): The absence of Z-score outliers for Age and Disease Probability confirms their distributions are well-contained. The near-absence of outliers in PC1 (2 outliers) is excellent, suggesting it is a highly stable and robust feature for use in regression, as extreme values will not distort its predictive power.

Setup Disease Status Classification¶

Model Preparation¶

In [298]:
from sklearn.model_selection import train_test_split

X = cohort[numeric_features + pcs_features + categorical_features]
y = cohort["disease_status"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=SEED
)

X_train.shape, X_test.shape
Out[298]:
((7000, 7), (3000, 7))

Standardization¶

In [299]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled_num = scaler.fit_transform(X_train[numeric_pcs_features])
X_train_scaled = pd.DataFrame(
    X_train_scaled_num,
    columns=numeric_pcs_features,
    index=X_train.index
)

X_test_scaled_num = scaler.transform(X_test[numeric_pcs_features])
X_test_scaled = pd.DataFrame(
    X_test_scaled_num,
    columns=numeric_pcs_features,
    index=X_test.index
)

X_train_scaled["sex"] = X_train["sex"].values
X_test_scaled["sex"] = X_test["sex"].values

X_train_scaled.head()
Out[299]:
age env_index polygenic_score quant_trait PC1 PC2 sex
9069 -0.951245 -0.385407 -1.745874 -1.658782 -1.666482 -1.081947 0
2603 0.416455 0.299589 -0.116855 -0.836093 -1.699625 -0.606496 0
7738 0.553225 -0.651264 0.978902 -0.308846 2.083987 -0.817477 0
1579 1.168690 -1.364159 -0.024322 -0.576059 -0.369216 0.407394 1
5058 -0.130625 0.865069 -0.171785 0.621501 -0.218088 0.147418 0

Split Data Overview¶

In [300]:
core_vars = numeric_pcs_features + categorical_features

X_train_full = X_train_scaled[core_vars]
X_test_full = X_test_scaled[core_vars]

X_train_full.head()
Out[300]:
age env_index polygenic_score quant_trait PC1 PC2 sex
9069 -0.951245 -0.385407 -1.745874 -1.658782 -1.666482 -1.081947 0
2603 0.416455 0.299589 -0.116855 -0.836093 -1.699625 -0.606496 0
7738 0.553225 -0.651264 0.978902 -0.308846 2.083987 -0.817477 0
1579 1.168690 -1.364159 -0.024322 -0.576059 -0.369216 0.407394 1
5058 -0.130625 0.865069 -0.171785 0.621501 -0.218088 0.147418 0
In [301]:
X_train_scaled[numeric_features].mean(), X_train_scaled[numeric_features].std()
Out[301]:
(age               -8.932537e-17
 env_index         -3.045183e-18
 polygenic_score   -1.827110e-17
 quant_trait       -1.395709e-17
 dtype: float64,
 age                1.000071
 env_index          1.000071
 polygenic_score    1.000071
 quant_trait        1.000071
 dtype: float64)
In [302]:
X_test_scaled[numeric_features].mean(), X_test_scaled[numeric_features].std()
Out[302]:
(age                0.028347
 env_index         -0.000676
 polygenic_score    0.033285
 quant_trait        0.002218
 dtype: float64,
 age                1.002912
 env_index          1.029294
 polygenic_score    0.988079
 quant_trait        0.970384
 dtype: float64)

Just checking the data before I run models, so making sure test and training have the same amount of columns.

Classification Models¶

Functions that will be used constantly for model evaluation

In [303]:
from sklearn.metrics import accuracy_score, roc_auc_score, precision_recall_curve, roc_curve, average_precision_score

def evaluate_and_graph_clf(model, X_train, y_train, X_test, y_test, name, graph):
    model.fit(X_train, y_train)
    y_pred_test = model.predict(X_test)
    y_pred_train = model.predict(X_train)

    if hasattr(model, "predict_proba"):
        y_prob_test = model.predict_proba(X_test)[:, 1]
        y_prob_train = model.predict_proba(X_train)[:, 1]
    else:
        y_prob_test = model.decision_function(X_test)
        y_prob_train = model.decision_function(X_train)

    test_acc = accuracy_score(y_test, y_pred_test)
    test_auc = roc_auc_score(y_test, y_prob_test)
    test_ap = average_precision_score(y_test, y_prob_test)

    train_acc = accuracy_score(y_train, y_pred_train)
    train_auc = roc_auc_score(y_train, y_prob_train)

    print(f"--- {name} ---")
    print(f"Train Accuracy: {train_acc:.4f} | Train AUC: {train_auc:.4f}")
    print(f"Test  Accuracy: {test_acc:.4f} | Test  AUC: {test_auc:.4f}")

    if (train_acc - test_acc) > 0.05:
         print("⚠️ Warning: Signs of Overfitting (Train is much better than Test)")
    else:
         print("✅ Model seems balanced")
    print("-" * 30)

    fpr, tpr, _ = roc_curve(y_test, y_prob_test)
    precision, recall, _ = precision_recall_curve(y_test, y_prob_test)

    if graph:
        fig, axes = plt.subplots(1, 2, figsize=(14, 6))

        # --- Plot 1: ROC Curve ---
        axes[0].plot(fpr, tpr, color='darkorange', lw=2, label=f'AUC = {test_auc:.3f}')
        axes[0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guess')
        axes[0].set_xlim([0.0, 1.0])
        axes[0].set_ylim([0.0, 1.05])
        axes[0].set_xlabel('False Positive Rate')
        axes[0].set_ylabel('True Positive Rate')
        axes[0].set_title(f'ROC Curve: {name}')
        axes[0].legend(loc="lower right")
        axes[0].grid(True, alpha=0.3)

        # --- Plot 2: Precision-Recall Curve ---
        # "No Skill" baseline is just the percentage of positive cases
        no_skill = y_test.mean()

        axes[1].plot(recall, precision, color='green', lw=2, label=f'Avg Precision = {test_ap:.3f}')
        axes[1].plot([0, 1], [no_skill, no_skill], color='navy', linestyle='--', label='No Skill')
        axes[1].set_xlim([0.0, 1.0])
        axes[1].set_ylim([0.0, 1.05])
        axes[1].set_xlabel('Recall')
        axes[1].set_ylabel('Precision')
        axes[1].set_title(f'PR Curve: {name}')
        axes[1].legend(loc="upper right")
        axes[1].grid(True, alpha=0.3)

        plt.tight_layout()
        plt.show()

    return {
        "model": name,
        "accuracy": test_acc,
        "auc": test_auc,
        "average_precision": test_ap,
        "train_accuracy": train_acc,
        "train_auc": train_auc
    }

def pick_best_model_clf(dataframe, overfit_threshold=0.05):
    if dataframe is None or dataframe.empty:
        print("No models provided (empty DataFrame).")
        return None

    df = dataframe.copy()

    df['overfitting_gap'] = df['train_auc'] - df['auc']
    df['is_overfit'] = df['overfitting_gap'] > overfit_threshold

    df_valid = df[df['is_overfit'] == False].copy()
    df_overfit = df[df['is_overfit'] == True].copy()

    print(f"Total Models: {len(df)}")
    print(f"Valid Models: {len(df_valid)}")
    print(f"Disqualified Models: {len(df_overfit)}")

    if not df_overfit.empty:
        print("\n⚠️ The following models were disqualified due to overfitting:")
        display(df_overfit[['model', 'accuracy', 'auc', 'train_auc', 'overfitting_gap']])
    else:
        print("\n✅ No models were disqualified for overfitting.")

    if df_valid.empty:
        print("\n⚠️ All models are overfitting (by the chosen threshold).")
        print("Falling back to ranking ALL models by AUC.")
        df_valid = df.copy()

    print("Best by Accuracy:")
    display(df_valid.sort_values(by="accuracy", ascending=False).head(1))

    print("Best by AUC:")
    display(df_valid.sort_values(by="auc", ascending=False).head(1))

    print("Best by Average Precision:")
    display(df_valid.sort_values(by="average_precision", ascending=False).head(1))

    df_valid_ranked = df_valid.copy()
    df_valid_ranked["rank_acc"] = df_valid["accuracy"].rank(ascending=False)
    df_valid_ranked["rank_auc"] = df_valid["auc"].rank(ascending=False)
    df_valid_ranked["rank_ap"] = df_valid["average_precision"].rank(ascending=False)


    final_ranking = df_valid_ranked.sort_values(
        by=["rank_auc", "rank_acc", "rank_ap"]
    )

    cols = ['model', 'accuracy', 'auc', 'average_precision', 'overfitting_gap']
    print("\nFinal ranking (higher = better):")
    display(final_ranking[cols])

    best_model_row = final_ranking.iloc[0]
    best_model_name = best_model_row['model']
    print(f"\n🏆 Best model: {best_model_name}")

    return best_model_name

Bayes Classifier¶

In [304]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB

results_clf = []

gnb = GaussianNB()
results_clf.append(
    evaluate_and_graph_clf(gnb, X_train_full, y_train, X_test_full, y_test, "Naive Bayes (Gaussian)", True)
)

bnb = BernoulliNB()
results_clf.append(
    evaluate_and_graph_clf(bnb, X_train_full, y_train, X_test_full, y_test, "Naive Bayes (Bernoulli)", True)
)
--- Naive Bayes (Gaussian) ---
Train Accuracy: 0.6769 | Train AUC: 0.7383
Test  Accuracy: 0.6697 | Test  AUC: 0.7341
✅ Model seems balanced
------------------------------
No description has been provided for this image
--- Naive Bayes (Bernoulli) ---
Train Accuracy: 0.6601 | Train AUC: 0.7182
Test  Accuracy: 0.6527 | Test  AUC: 0.7177
✅ Model seems balanced
------------------------------
No description has been provided for this image

The Guassian Naive Bayes model is a moderately effective classifier for predicting Disease Status. With a test Accuracy of $67\%$ and a test AUC of $0.734$, the model is stable (no overfitting) and demonstrates useful predictive power over random guessing.

The Bernoulli Naive Bayes model is a moderately effective classifier for predicting Disease Status. With a test Accuracy of $65\%$ and a test AUC of $0.718$, the model is stable (no overfitting) and demonstrates useful predictive power over random guessing.

The Gaussian Naive Bayes model is the preferred choice because the core predictive features in this dataset—the Polygenic Score, Environmental Index, and Quantitative Trait—all exhibit Standard Normal Distributions. Since Gaussian Naive Bayes explicitly assumes a normal (Gaussian) distribution for continuous features, it capitalizes on the true underlying structure of the data, which leads to higher accuracy ($67\%$) and better discriminatory power (AUC $0.734$) compared to the Bernoulli model.

K-Nearest Neighbors¶

In [305]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=11)
results_clf.append(
    evaluate_and_graph_clf(knn, X_train_full, y_train, X_test_full, y_test, "KNN (k=11)", True)
)
--- KNN (k=11) ---
Train Accuracy: 0.7101 | Train AUC: 0.7898
Test  Accuracy: 0.6537 | Test  AUC: 0.7143
⚠️ Warning: Signs of Overfitting (Train is much better than Test)
------------------------------
No description has been provided for this image

The K-Nearest Neighbors ($k=11$) model is a moderately effective classifier for predicting Disease Status. With a Test Accuracy of approximately $65\%$ and a Test AUC of $0.714$, the model demonstrates useful predictive power over random guessing (where AUC would be $0.5$). The model exhibits Signs of Overfitting, as explicitly warned in the output: the Train Accuracy ($71.01\%$) is significantly higher than the Test Accuracy ($65.37\%$), indicating the model learned the training data too well and does not generalize as effectively to new, unseen data.

Logistic Regression¶

In [306]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=5000)
results_clf.append(
    evaluate_and_graph_clf(lr, X_train_full, y_train, X_test_full, y_test, "Logistic Regression", True)
)
--- Logistic Regression ---
Train Accuracy: 0.6844 | Train AUC: 0.7585
Test  Accuracy: 0.6840 | Test  AUC: 0.7579
✅ Model seems balanced
------------------------------
No description has been provided for this image

The Logistic Regression model is a moderately effective classifier for predicting Disease Status. With a Test Accuracy of approximately $68.4\%$ and a Test AUC of $0.758$, the model is stable (no overfitting) and demonstrates useful predictive power over random guessing.

Generative Models (LDA and QDA)¶

In [307]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis()
results_clf.append(
    evaluate_and_graph_clf(lda, X_train_full, y_train, X_test_full, y_test, "LDA", True)
)
--- LDA ---
Train Accuracy: 0.6846 | Train AUC: 0.7584
Test  Accuracy: 0.6853 | Test  AUC: 0.7578
✅ Model seems balanced
------------------------------
No description has been provided for this image

The Linear Discriminant Analysis (LDA) model is a moderately effective classifier for predicting Disease Status. With a Test Accuracy of approximately $68.5\%$ and a Test AUC of $0.758$, the model is stable (no overfitting) and demonstrates useful predictive power over random guessing.

In [308]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

qda = QuadraticDiscriminantAnalysis(reg_param=0.1)
results_clf.append(
    evaluate_and_graph_clf(qda, X_train_full, y_train, X_test_full, y_test, "QDA", True)
)
--- QDA ---
Train Accuracy: 0.6821 | Train AUC: 0.7527
Test  Accuracy: 0.6810 | Test  AUC: 0.7482
✅ Model seems balanced
------------------------------
No description has been provided for this image

The Quadratic Discriminant Analysis (QDA) model is a moderately effective classifier for predicting Disease Status. With a Test Accuracy of approximately $68.1\%$ and a Test AUC of $0.748$, the model is stable (no overfitting) and demonstrates useful predictive power over random guessing

Classification Tree¶

In [309]:
from sklearn.tree import DecisionTreeClassifier

dt_gini = DecisionTreeClassifier(criterion='gini', max_depth=10, random_state=SEED)
results_clf.append(
    evaluate_and_graph_clf(dt_gini, X_train_full, y_train, X_test_full, y_test, "Decision Tree (Gini)", True)
)

dt_entropy = DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=SEED)
results_clf.append(
    evaluate_and_graph_clf(dt_entropy, X_train_full, y_train, X_test_full, y_test, "Decision Tree (Entropy)", True)
)
--- Decision Tree (Gini) ---
Train Accuracy: 0.7737 | Train AUC: 0.8671
Test  Accuracy: 0.6383 | Test  AUC: 0.6678
⚠️ Warning: Signs of Overfitting (Train is much better than Test)
------------------------------
No description has been provided for this image
--- Decision Tree (Entropy) ---
Train Accuracy: 0.7603 | Train AUC: 0.8543
Test  Accuracy: 0.6593 | Test  AUC: 0.6885
⚠️ Warning: Signs of Overfitting (Train is much better than Test)
------------------------------
No description has been provided for this image

The Decision Tree (Gini) model is a poor to moderately effective classifier for predicting Disease Status. With a Test Accuracy of approximately $63.8\%$ and a Test AUC of $0.668$, the model demonstrates predictive power that is only slightly better than a random guess (AUC $0.5$). The model exhibits Signs of Overfitting, as explicitly warned in the output: the Train Accuracy ($77.37\%$) is significantly higher than the Test Accuracy ($63.83\%$), indicating the model learned the training data too well and does not generalize as effectively to new, unseen data.

The Decision Tree (Entropy) model is a poor to moderately effective classifier for predicting Disease Status. With a Test Accuracy of approximately $65.9\%$ and a Test AUC of $0.688$, the model demonstrates predictive power that is only slightly better than a random guess (AUC $0.5$). The model exhibits Signs of Overfitting, as explicitly warned in the output: the Train Accuracy ($76.83\%$) is significantly higher than the Test Accuracy ($65.93\%$), indicating the model learned the training data too well and does not generalize as effectively to new, unseen data.

Both these models have overfit the data and should likely be not used unless parameters modified and tested again to prevent overfitting.

Ensemble Methods (Bagging, Random Forest, Gradient Boosting)¶

In [310]:
from sklearn.ensemble import BaggingClassifier

bagging = BaggingClassifier(n_estimators=50, random_state=SEED)
results_clf.append(
    evaluate_and_graph_clf(bagging, X_train_full, y_train, X_test_full, y_test, "Bagging (Trees)", True)
)
--- Bagging (Trees) ---
Train Accuracy: 0.9999 | Train AUC: 1.0000
Test  Accuracy: 0.6653 | Test  AUC: 0.7261
⚠️ Warning: Signs of Overfitting (Train is much better than Test)
------------------------------
No description has been provided for this image

The Bagging Classifier (using Decision Trees) model is a moderately effective classifier for predicting Disease Status. With a Test Accuracy of approximately $66.7\%$ and a Test AUC of $0.726$, the model demonstrates useful predictive power over random guessing (AUC $0.5$). The model exhibits Signs of Severe Overfitting, as explicitly warned in the output: the Train Accuracy ($99.99\%$) is significantly higher than the Test Accuracy ($66.65\%$), indicating the model learned the training data too well and does not generalize as effectively to new, unseen data.

In [311]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=SEED)
results_clf.append(
    evaluate_and_graph_clf(rf, X_train_full, y_train, X_test_full, y_test, "Random Forest", True)
)
--- Random Forest ---
Train Accuracy: 0.8233 | Train AUC: 0.9203
Test  Accuracy: 0.6790 | Test  AUC: 0.7457
⚠️ Warning: Signs of Overfitting (Train is much better than Test)
------------------------------
No description has been provided for this image

The Random Forest model is a moderately effective classifier for predicting Disease Status. With a Test Accuracy of approximately $68\%$ and a Test AUC of $0.746$, the model demonstrates useful predictive power over random guessing (AUC $0.5$). The model exhibits Signs of Severe Overfitting, as explicitly warned in the output: the Train Accuracy ($88.23\%$) is significantly higher than the Test Accuracy ($67.90\%$), indicating the model learned the training data too well and does not generalize as effectively to new, unseen data.

In [312]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=SEED)
results_clf.append(
    evaluate_and_graph_clf(gb, X_train_full, y_train, X_test_full, y_test, "Gradient Boosting", True)
)
--- Gradient Boosting ---
Train Accuracy: 0.7164 | Train AUC: 0.7952
Test  Accuracy: 0.6837 | Test  AUC: 0.7497
✅ Model seems balanced
------------------------------
No description has been provided for this image

The Gradient Boosting Classifier model is a moderately effective classifier for predicting Disease Status. With a Test Accuracy of approximately $68.4\%$ and a Test AUC of $0.750$, the model is stable (no overfitting) and demonstrates useful predictive power over random guessing.

SVC¶

In [313]:
from sklearn.svm import SVC

svm_linear = SVC(kernel='linear', probability=True, random_state=SEED)
results_clf.append(
    evaluate_and_graph_clf(svm_linear, X_train_full, y_train, X_test_full, y_test, "SVM (Linear)", True)
)

svm_poly = SVC(kernel='poly', degree=3, probability=True, random_state=SEED)
results_clf.append(
    evaluate_and_graph_clf(svm_poly, X_train_full, y_train, X_test_full, y_test, "SVM (Polynomial)", True)
)
svm_rbf = SVC(kernel='rbf', probability=True, random_state=SEED)
results_clf.append(
    evaluate_and_graph_clf(svm_rbf, X_train_full, y_train, X_test_full, y_test, "SVM (RBF)", True)
)
--- SVM (Linear) ---
Train Accuracy: 0.6870 | Train AUC: 0.7585
Test  Accuracy: 0.6847 | Test  AUC: 0.7577
✅ Model seems balanced
------------------------------
No description has been provided for this image
--- SVM (Polynomial) ---
Train Accuracy: 0.6837 | Train AUC: 0.7578
Test  Accuracy: 0.6730 | Test  AUC: 0.7372
✅ Model seems balanced
------------------------------
No description has been provided for this image
--- SVM (RBF) ---
Train Accuracy: 0.6949 | Train AUC: 0.7667
Test  Accuracy: 0.6797 | Test  AUC: 0.7423
✅ Model seems balanced
------------------------------
No description has been provided for this image

The Linear SVM model is a moderately effective classifier for predicting Disease Status. With a Test Accuracy of approximately $68.4\%$ and a Test AUC of $0.758$, the model is stable (no overfitting) and demonstrates useful predictive power over random guessing.

The Polynomial SVM model is a moderately effective classifier for predicting Disease Status. With a Test Accuracy of approximately $67.3\%$ and a Test AUC of $0.737$, the model is stable (no overfitting) and demonstrates useful predictive power over random guessing.

The RBF SVM model is a moderately effective classifier for predicting Disease Status. With a Test Accuracy of approximately $68.0\%$ and a Test AUC of $0.742$, the model is stable (no overfitting) and demonstrates useful predictive power over random guessing.

The Linear Support Vector Machine (SVM) model is one of the top-performing and most stable classifiers for predicting Disease Status, achieving the highest Test Accuracy ($\approx 68.4\%$) and Test AUC ($0.758$), demonstrating that a simple linear decision boundary is highly effective for separating the two disease status groups in this dataset.

Results Classification¶

In [314]:
df_clf = pd.DataFrame(results_clf)
pick_best_model_clf(df_clf, overfit_threshold=0.05)
Total Models: 14
Valid Models: 9
Disqualified Models: 5

⚠️ The following models were disqualified due to overfitting:
model accuracy auc train_auc overfitting_gap
2 KNN (k=11) 0.653667 0.714328 0.789760 0.075432
6 Decision Tree (Gini) 0.638333 0.667801 0.867059 0.199258
7 Decision Tree (Entropy) 0.659333 0.688465 0.854251 0.165785
8 Bagging (Trees) 0.665333 0.726099 1.000000 0.273901
9 Random Forest 0.679000 0.745717 0.920325 0.174608
Best by Accuracy:
model accuracy auc average_precision train_accuracy train_auc overfitting_gap is_overfit
4 LDA 0.685333 0.757759 0.750591 0.684571 0.758439 0.000681 False
Best by AUC:
model accuracy auc average_precision train_accuracy train_auc overfitting_gap is_overfit
3 Logistic Regression 0.684 0.757874 0.750782 0.684429 0.758455 0.000582 False
Best by Average Precision:
model accuracy auc average_precision train_accuracy train_auc overfitting_gap is_overfit
3 Logistic Regression 0.684 0.757874 0.750782 0.684429 0.758455 0.000582 False
Final ranking (higher = better):
model accuracy auc average_precision overfitting_gap
3 Logistic Regression 0.684000 0.757874 0.750782 0.000582
4 LDA 0.685333 0.757759 0.750591 0.000681
11 SVM (Linear) 0.684667 0.757651 0.750271 0.000848
10 Gradient Boosting 0.683667 0.749663 0.740665 0.045580
5 QDA 0.681000 0.748234 0.733072 0.004476
13 SVM (RBF) 0.679667 0.742300 0.709593 0.024385
12 SVM (Polynomial) 0.673000 0.737244 0.721664 0.020582
0 Naive Bayes (Gaussian) 0.669667 0.734063 0.723929 0.004259
1 Naive Bayes (Bernoulli) 0.652667 0.717664 0.709780 0.000544
🏆 Best model: Logistic Regression
Out[314]:
'Logistic Regression'
  1. The Best Performers: Exploiting the Linear Ground Truth The Logistic Regression and LDA (Linear Discriminant Analysis) models achieved the highest performance, with nearly identical results:

    • Logistic Regression | 0.684000 | 0.757873 | 0.750610 | 0.000581
    • LDA | 0.685333 | 0.757759 | 0.750591 | 0.000681

    This is the expected outcome. Since the simulated disease phenotype was derived from a linear logistic function of the predictors, the true optimal decision boundary is linear. Both Logistic Regression and LDA are powerful linear classifiers and were able to optimally exploit this known underlying structure in the data.

  2. Models Penalized for Over-Flexibility or Local Assumptions Models that introduced unnecessary complexity or relied on localized data features performed worse, failing to capture the global linear pattern as efficiently:

    • QDA (Quadratic Discriminant Analysis) performed slightly worse ($\text{Accuracy} = 0.681$, $\text{AUC} \approx 0.749$). Reasoning: QDA estimates separate, independent covariance matrices for each class, giving it a quadratic decision boundary. When the underlying truth is linear, estimating two matrices introduces statistical noise and unnecessary complexity where one matrix would suffice, resulting in a minor performance hit.

    • SVM (RBF Kernel) achieved moderate performance ($\text{Accuracy} \approx 0.680$, $\text{AUC} \approx 0.742$). Reasoning: The RBF kernel provides nonlinear flexibility to the SVM. Since the true model is linear, this extra flexibility is redundant, offering no advantage over linear models and slightly reducing efficiency.

    • KNN ($\text{k}=11$) had the lowest performance among the evaluated models ($\text{Accuracy} \approx 0.654$, $\text{AUC} \approx 0.714$). Reasoning: As a non-parametric, distance-based classifier, KNN cannot explicitly model or exploit a global linear relationship. It is highly sensitive to local noise and struggles in high-dimensional settings where the signal is weak and smoothly varying, making it a poor choice for this type of genetic data.

Overall, the classification task demonstrates moderate but limited predictive power ($\text{AUC}$ in the $0.75$ range). This is appropriate given the simulated scenario, where the $\text{disease\_status}$ outcome was constructed to be noisy and only moderately dependent on the input genotype predictors.The classification analysis primarily serves to validate model behavior under a known, nearly linear ground truth, illustrating why simpler, interpretable linear models are often superior when they match the data-generating process. This contrasts with the deeper biological and scientific interpretability likely provided by the regression modeling of the quantitative trait.

In [315]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

class_labels = ['Unaffected', 'affected']

y_pred = lr.predict(X_test_full)

cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(8, 8))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=class_labels
)

disp.plot(cmap=plt.cm.Blues, values_format='d', ax=ax)
plt.title('Confusion Matrix for Best Model', fontsize=16)
Out[315]:
Text(0.5, 1.0, 'Confusion Matrix for Best Model')
No description has been provided for this image

Key Metrics and Interpretation¶

The performance is broken down into four essential categories:

  • True Positives (TP): 1047
    • Meaning: These are the individuals who were actually affected by the disease and were correctly predicted as being affected by the model.
  • True Negatives (TN): 1035
    • Meaning: These are the individuals who were actually unaffected by the disease and were correctly predicted as being unaffected.
  • False Positives (FP): 500 (Type I Error)
    • Meaning: These are the individuals who were actually unaffected but were incorrectly predicted as being affected. This is often referred to as a "false alarm."
  • False Negatives (FN): 448 (Type II Error)
    • Meaning: These are the individuals who were actually affected but were incorrectly predicted as being unaffected. This is often the more critical error, as the model missed a true case.

Overall Model Performance¶

From this matrix, we can calculate the core metrics that define the model's performance:

  • Accuracy: The percentage of all samples the model classified correctly. $$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{Total Samples}} = \frac{1047 + 1035}{3030} \approx \mathbf{68.7\%}$$
  • Precision (Predictive Accuracy for 'Affected'): Of all the individuals the model predicted were affected, what percentage were actually affected? $$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{1047}{1047 + 500} \approx \mathbf{67.6\%}$$
  • Recall / Sensitivity (Model's Ability to Find All 'Affected' Cases): Of all the individuals who were actually affected, what percentage did the model correctly find? $$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{1047}{1047 + 448} \approx \mathbf{70.0\%}$$

The model shows a solid balance, correctly identifying 70% of all true disease cases (Recall) while maintaining a 67.6% accuracy when making a positive prediction (Precision).

Setup Quant Trait Regression¶

Model Preparation¶

In [316]:
X = cohort[numeric_pcs_features + categorical_features]
y = cohort["quant_trait"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=SEED
)

X_train.shape, X_test.shape
Out[316]:
((7000, 7), (3000, 7))

Standardization¶

In [317]:
scaler = StandardScaler()

X_train_scaled_num = scaler.fit_transform(X_train[numeric_pcs_features])
X_train_scaled = pd.DataFrame(
    X_train_scaled_num,
    columns=numeric_pcs_features,
    index=X_train.index
)

X_test_scaled_num = scaler.transform(X_test[numeric_pcs_features])
X_test_scaled = pd.DataFrame(
    X_test_scaled_num,
    columns=numeric_pcs_features,
    index=X_test.index
)

X_train_scaled["sex"] = X_train["sex"].values
X_test_scaled["sex"] = X_test["sex"].values

X_train_scaled.head()
Out[317]:
age env_index polygenic_score quant_trait PC1 PC2 sex
9069 -0.951245 -0.385407 -1.745874 -1.658782 -1.666482 -1.081947 0
2603 0.416455 0.299589 -0.116855 -0.836093 -1.699625 -0.606496 0
7738 0.553225 -0.651264 0.978902 -0.308846 2.083987 -0.817477 0
1579 1.168690 -1.364159 -0.024322 -0.576059 -0.369216 0.407394 1
5058 -0.130625 0.865069 -0.171785 0.621501 -0.218088 0.147418 0

Split Data Overview¶

In [318]:
# 1) Baseline: PRS only
features_baseline = ["polygenic_score"]

# 2) Full model: PRS + covariates (NO PCs)
features_full = ["age", "env_index", "polygenic_score", "sex"]

# 3) Full + PCA: PRS + covariates + PCs
features_full_pca = features_full + pcs_features
In [319]:
X_train_baseline = X_train_scaled[features_baseline]
X_test_baseline = X_test_scaled[features_baseline]

X_train_baseline.head(), X_train_baseline.mean(), X_train_baseline.std()
Out[319]:
(      polygenic_score
 9069        -1.745874
 2603        -0.116855
 7738         0.978902
 1579        -0.024322
 5058        -0.171785,
 polygenic_score   -1.827110e-17
 dtype: float64,
 polygenic_score    1.000071
 dtype: float64)
In [320]:
X_train_full = X_train_scaled[features_full]
X_test_full = X_test_scaled[features_full]

X_train_full.head(), X_train_full.mean(), X_train_full.std()
Out[320]:
(           age  env_index  polygenic_score  sex
 9069 -0.951245  -0.385407        -1.745874    0
 2603  0.416455   0.299589        -0.116855    0
 7738  0.553225  -0.651264         0.978902    0
 1579  1.168690  -1.364159        -0.024322    1
 5058 -0.130625   0.865069        -0.171785    0,
 age               -8.932537e-17
 env_index         -3.045183e-18
 polygenic_score   -1.827110e-17
 sex                4.994286e-01
 dtype: float64,
 age                1.000071
 env_index          1.000071
 polygenic_score    1.000071
 sex                0.500035
 dtype: float64)
In [321]:
X_train_full_pca = X_train_scaled[features_full_pca]
X_test_full_pca  = X_test_scaled[features_full_pca]

X_train_full_pca.head(), X_train_full_pca.mean(), X_train_full_pca.std()
Out[321]:
(           age  env_index  polygenic_score  sex       PC1       PC2
 9069 -0.951245  -0.385407        -1.745874    0 -1.666482 -1.081947
 2603  0.416455   0.299589        -0.116855    0 -1.699625 -0.606496
 7738  0.553225  -0.651264         0.978902    0  2.083987 -0.817477
 1579  1.168690  -1.364159        -0.024322    1 -0.369216  0.407394
 5058 -0.130625   0.865069        -0.171785    0 -0.218088  0.147418,
 age               -8.932537e-17
 env_index         -3.045183e-18
 polygenic_score   -1.827110e-17
 sex                4.994286e-01
 PC1               -2.131628e-17
 PC2                1.167320e-17
 dtype: float64,
 age                1.000071
 env_index          1.000071
 polygenic_score    1.000071
 sex                0.500035
 PC1                1.000071
 PC2                1.000071
 dtype: float64)

Just checking the data before I run models, so making sure test and training have the same amount of columns.

Regression Models¶

Functions that will be used constantly for model evaluation

In [322]:
from sklearn.metrics import mean_squared_error, r2_score

def evaluate_and_graph_reg(model, X_train, y_train, X_test, y_test, name, graph):
    model.fit(X_train, y_train)

    # Predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    # --- Metrics (same as evaluate_and_graph_reg) ---
    mse_train = mean_squared_error(y_train, y_pred_train)
    mse_test = mean_squared_error(y_test, y_pred_test)

    rmse_train = np.sqrt(mse_train)
    rmse_test = np.sqrt(mse_test)

    r2_train = r2_score(y_train, y_pred_train)
    r2_test = r2_score(y_test, y_pred_test)

    print(f"--- {name} ---")
    print(f"Train RMSE: {rmse_train:.4f} | Train R²: {r2_train:.4f}")
    print(f"Test  RMSE: {rmse_test:.4f} | Test  R²: {r2_test:.4f}")

    if (r2_train - r2_test) > 0.05:
        print("⚠️ Warning: Signs of Overfitting (Train R² much higher than Test R²)")
    else:
        print("✅ Model seems reasonably balanced")
    print("-" * 30)

    # --- Plots (equivalent to plot_full_diagnostics) ---
    if graph:
        residuals = y_test - y_pred_test
        r2 = r2_test

        fig, axs = plt.subplots(2, 2, figsize=(12, 10))

        # 1) Predicted vs True
        axs[0, 0].scatter(y_test, y_pred_test, alpha=0.5)
        slope, intercept = np.polyfit(y_test, y_pred_test, 1)
        axs[0, 0].plot(y_test, slope * y_test + intercept, color="blue", label="Regression Line")
        axs[0, 0].plot(
            [y_test.min(), y_test.max()],
            [y_test.min(), y_test.max()],
            color="red", linestyle="--", label="Identity Line"
        )
        axs[0, 0].set_xlabel("True Values")
        axs[0, 0].set_ylabel("Predicted Values")
        axs[0, 0].set_title(f"Predicted vs True ({name})\n$R^2 = {r2:.3f}$")
        axs[0, 0].legend()

        # 2) Residual Histogram
        sns.histplot(residuals, kde=True, bins=20, ax=axs[0, 1])
        axs[0, 1].set_title("Residual Distribution")
        axs[0, 1].set_xlabel("Residual")
        axs[0, 1].set_ylabel("Count")

        # 3) Residuals vs Fitted
        axs[1, 0].scatter(y_pred_test, residuals, alpha=0.5)
        axs[1, 0].axhline(0, color="red", linestyle="--")
        axs[1, 0].set_xlabel("Fitted Values")
        axs[1, 0].set_ylabel("Residuals")
        axs[1, 0].set_title("Residuals vs Fitted")

        # 4) QQ Plot
        stats.probplot(residuals, dist="norm", plot=axs[1, 1])
        axs[1, 1].set_title("QQ Plot")

        plt.tight_layout()
        plt.show()

    return {
        "model": name,
        "rmse_train": rmse_train,
        "rmse_test": rmse_test,
        "r2_train": r2_train,
        "r2_test": r2_test,
    }

def pick_best_model_reg(dataframe, overfit_threshold=0.05):
    if dataframe is None or dataframe.empty:
        print("No models provided (empty DataFrame).")
        return None

    df = dataframe.copy()

    # Overfitting criterion: big gap in R² between train and test
    df['overfitting_gap'] = df['r2_train'] - df['r2_test']
    df['is_overfit'] = df['overfitting_gap'] > overfit_threshold

    df_valid = df[~df['is_overfit']].copy()
    df_overfit = df[df['is_overfit']].copy()

    print(f"Total Models: {len(df)}")
    print(f"Valid Models: {len(df_valid)}")
    print(f"Disqualified Models: {len(df_overfit)}")

    if not df_overfit.empty:
        print("\n⚠️ Disqualified for overfitting:")
        display(df_overfit[['model', 'r2_test', 'rmse_test', 'r2_train', 'overfitting_gap']])
    else:
        print("\n✅ No models were disqualified for overfitting.")

    if df_valid.empty:
        print("\n⚠️ All models flagged as overfitting; ranking ALL models by R².")
        df_valid = df.copy()

    print("Best by R² (Test Set):")
    display(df_valid.sort_values(by="r2_test", ascending=False).head(1))

    print("Best by RMSE (Test Set):")
    display(df_valid.sort_values(by="rmse_test", ascending=True).head(1))

    # print("Best by MAE:")
    # display(df_valid.sort_values(by="mae", ascending=True).head(1))

    df_valid_ranked = df_valid.copy()
    df_valid_ranked["rank_r2"] = df_valid_ranked["r2_test"].rank(ascending=False)
    df_valid_ranked["rank_rmse"] = df_valid_ranked["rmse_test"].rank(ascending=True)
    # df_valid_ranked["rank_mae"] = df_valid_ranked["mae"].rank(ascending=True)

    final_ranking = df_valid_ranked.sort_values(
        by=["rank_r2", "rank_rmse"] #, "rank_mae"]
    )

    # cols = ['model', 'r2', 'rmse', 'mae', 'overfitting_gap']
    cols = ['model', 'r2_test', 'rmse_test', 'overfitting_gap']
    print("\nFinal ranking (better at top):")
    display(final_ranking[cols])

    best_model_row = final_ranking.iloc[0]
    best_model_name = best_model_row['model']
    print(f"\n🏆 Best regression model: {best_model_name}")

    return best_model_name

Linear Regression¶

In [323]:
from sklearn.linear_model import LinearRegression

results_reg = []

# 1. Baseline: PRS only
lr_baseline = LinearRegression()
results_reg.append(
    evaluate_and_graph_reg(lr_baseline, X_train_baseline, y_train, X_test_baseline, y_test, "LR Baseline (PRS only)", True)
)

# 2. Full model
lr_full = LinearRegression()
results_reg.append(
    evaluate_and_graph_reg(lr_full, X_train_full, y_train, X_test_full, y_test, "LR Full (PRS + covariates)", True)
)

# 3. Full + PCA
lr_full_pca = LinearRegression()
results_reg.append(
    evaluate_and_graph_reg(lr_full_pca, X_train_full_pca, y_train, X_test_full_pca, y_test, "LR Full + PCA", True)
)
--- LR Baseline (PRS only) ---
Train RMSE: 0.7544 | Train R²: 0.4409
Test  RMSE: 0.7349 | Test  R²: 0.4364
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image
--- LR Full (PRS + covariates) ---
Train RMSE: 0.6646 | Train R²: 0.5660
Test  RMSE: 0.6444 | Test  R²: 0.5666
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image
--- LR Full + PCA ---
Train RMSE: 0.6645 | Train R²: 0.5661
Test  RMSE: 0.6443 | Test  R²: 0.5667
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image

The Linear Regression Baseline model demonstrates a moderately strong linear relationship between the Polygenic Score (PRS) and the True Values, achieving a Test $R^2$ of $0.436$. This means the Polygenic Score accounts for about $43.6\%$ of the variance in the predicted trait. The model is stable with low error (Test RMSE $\approx 0.440$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered) and the Residual Distribution (which is normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, indicating the model adheres well to the assumption of normally distributed errors.

The Linear Regression Full Model (PRS + covariates) demonstrates a stronger linear relationship between the full set of predictors and the True Values, achieving a Test $R^2$ of $0.566$. This means the Polygenic Score and covariates together account for about $56.6\%$ of the variance in the predicted trait, an improvement over the baseline model. The model is highly stable with low error (Test RMSE $\approx 0.644$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered around zero) and the Residual Distribution (which is perfectly normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, indicating the model adheres perfectly to the assumption of normally distributed errors.

The Linear Regression ($\text{LR Full + PCA}$) model demonstrates a strong linear relationship between the full set of predictors (including the Principal Components) and the True Values, achieving a Test $R^2$ of $0.567$. This means the full set of features accounts for approximately $56.7\%$ of the variance in the predicted trait, achieving performance almost identical to the model using raw covariates. The model is highly stable with low error ($\text{Test RMSE} \approx 0.644$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered around zero) and the Residual Distribution (which is perfectly normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, confirming the model adheres perfectly to the assumption of normally distributed errors.

The Linear Regression Full Model (using PRS and covariates) and the Linear Regression (LR Full + PCA) model achieved the best performance, accounting for approximately $56.6\%$ to $56.7\%$ of the variance in the predicted trait ($R^2 \approx 0.567$), a notable improvement over the PRS-only Baseline model ($R^2 = 0.436$). All three models are statistically stable and meet the necessary assumptions, showing low error and no significant bias.

Poisson Regression¶

In [324]:
print(f"Number of negative y values: {np.sum(y_train < 0)}")

print(f"Minimum value of y_train: {y_train.min()}")
Number of negative y values: 3514
Minimum value of y_train: -3.674438222164647

Because of the negative numbers I cannot run a Poisson Regression.

Subset Selection¶

In [325]:
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

# --- Configuration ---
base_estimator = LinearRegression()
features_to_select = 4

# --- 1. RFE on Full Features ---
rfe_full = RFE(
    estimator=base_estimator,
    n_features_to_select=features_to_select
)

results_reg.append(
    evaluate_and_graph_reg(rfe_full, X_train_full, y_train, X_test_full, y_test, f"RFE Subset Selection (Full, {features_to_select} feats)", True)
)

# # --- 2. RFE on Full + PCA Features ---
pca_components = 4
rfe_pca_select = 4

rfe_full_pca = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=pca_components)),
    ('rfe', RFE(
        estimator=base_estimator,
        n_features_to_select=rfe_pca_select
    ))
])

results_reg.append(
    evaluate_and_graph_reg(rfe_full_pca, X_train_full_pca, y_train, X_test_full_pca, y_test, f"RFE Subset (Full + PCA, selected {rfe_pca_select} components)", True)
)
--- RFE Subset Selection (Full, 4 feats) ---
Train RMSE: 0.6646 | Train R²: 0.5660
Test  RMSE: 0.6444 | Test  R²: 0.5666
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image
--- RFE Subset (Full + PCA, selected 4 components) ---
Train RMSE: 0.7299 | Train R²: 0.4765
Test  RMSE: 0.7120 | Test  R²: 0.4709
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image

The RFE Subset Selection (Full, 4 Feats) Linear Regression model demonstrates a strong linear relationship between the four automatically selected predictors and the True Values, achieving a Test $R^2$ of $0.567$. This means the 4 features selected by RFE account for approximately $56.7\%$ of the variance in the predicted trait—an exceptional result given the feature reduction. The model is highly stable with low error ($\text{Test RMSE} \approx 0.644$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered around zero) and the Residual Distribution (which is perfectly normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, indicating the model adheres perfectly to the assumption of normally distributed errors.

The RFE Subset Selection ($\text{RFE Full, All Feats}$) Linear Regression model demonstrates a strong linear relationship between the full set of predictors and the True Values, achieving a Test $R^2$ of $0.567$. This means the full set of features accounts for approximately $56.7\%$ of the variance in the predicted trait. The model is highly stable with low error ($\text{Test RMSE} \approx 0.644$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered around zero) and the Residual Distribution (which is perfectly normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, confirming the model adheres perfectly to the assumption of normally distributed errors.

The Recursive Feature Elimination (RFE) process successfully identified a minimal subset of four features that perform just as well as the full set of features, both achieving a Test $R^2$ of $0.567$. This demonstrates that the model can be simplified substantially without sacrificing predictive accuracy, as both RFE-based linear regression models are highly stable and meet all statistical assumptions (Test RMSE $\approx 0.644$).

Forward and Backward Selections¶

In [326]:
import statsmodels.api as sm

def forward_selection(X, y, criterion="aic"):
    remaining = list(X.columns)
    selected = []
    current_score, best_new_score = np.inf, np.inf

    while remaining:
        scores_with_candidates = []

        for candidate in remaining:
            formula_vars = selected + [candidate]
            X_model = sm.add_constant(X[formula_vars])
            model = sm.OLS(y, X_model).fit()

            if criterion == "aic":
                score = model.aic
            else:
                score = model.bic

            scores_with_candidates.append((score, candidate))

        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates[0]

        if best_new_score < current_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
        else:
            break

    return selected, current_score

def backward_selection(X, y, criterion="aic"):
    selected = list(X.columns)
    current_score = np.inf

    while len(selected) > 0:
        scores_with_candidates = []

        for candidate in selected:
            trial = selected.copy()
            trial.remove(candidate)

            X_model = sm.add_constant(X[trial])
            model = sm.OLS(y, X_model).fit()

            if criterion == "aic":
                score = model.aic
            else:
                score = model.bic

            scores_with_candidates.append((score, candidate))

        scores_with_candidates.sort()
        best_new_score, worst_candidate = scores_with_candidates[0]

        if best_new_score < current_score:
            selected.remove(worst_candidate)
            current_score = best_new_score
        else:
            break

    return selected, current_score

X_train_baseline_copy = X_train_baseline.copy()
X_test_baseline_copy = X_test_baseline.copy()

X_train_full_copy = X_train_full.copy()
X_test_full_copy = X_test_full.copy()

X_train_full_pca_copy = X_train_full_pca.copy()
X_test_full_pca_copy = X_test_full_pca.copy()

y_train_copy = y_train.copy()
y_test_copy = y_test.copy()
In [327]:
forward_aic_vars_baseline, forward_aic_score_baseline = forward_selection(X_train_baseline_copy, y_train_copy, criterion="aic")
backward_aic_vars_baseline, backward_aic_score_baseline = backward_selection(X_train_baseline_copy, y_train_copy, criterion="aic")

print(forward_aic_vars_baseline, backward_aic_vars_baseline)

forward_bic_vars_baseline, forward_bic_score_baseline = forward_selection(X_train_baseline_copy, y_train_copy, criterion="bic")
backward_bic_vars_baseline, backward_bic_score_baseline = backward_selection(X_train_baseline_copy, y_train_copy, criterion="bic")

print(forward_bic_vars_baseline, backward_bic_vars_baseline)
['polygenic_score'] []
['polygenic_score'] []
In [328]:
forward_aic_vars_full, forward_aic_score_full = forward_selection(X_train_full_copy, y_train_copy, criterion="aic")
backward_aic_vars_full, backward_aic_score_full = backward_selection(X_train_full_copy, y_train_copy, criterion="aic")

print(forward_aic_vars_full, backward_aic_vars_full)

forward_bic_vars_full, forward_bic_score_full = forward_selection(X_train_full_copy, y_train_copy, criterion="bic")
backward_bic_vars_full, backward_bic_scor_full = backward_selection(X_train_full_copy, y_train_copy, criterion="bic")

print(forward_bic_vars_full, backward_bic_vars_full)
['polygenic_score', 'env_index', 'sex'] ['env_index', 'polygenic_score', 'sex']
['polygenic_score', 'env_index', 'sex'] ['env_index', 'polygenic_score', 'sex']
In [329]:
forward_aic_vars_full_pca, forward_aic_score_full_pca = forward_selection(X_train_full_pca_copy, y_train_copy, criterion="aic")
backward_aic_vars_full_pca, backward_aic_score_full_pca = backward_selection(X_train_full_pca_copy, y_train_copy, criterion="aic")

print(forward_aic_vars_full_pca, backward_aic_vars_full_pca)

forward_bic_vars_full_pca, forward_bic_score_full_pca = forward_selection(X_train_full_pca_copy, y_train_copy, criterion="bic")
backward_bic_vars_full_pca, backward_bic_scor_full_pca = backward_selection(X_train_full_pca_copy, y_train_copy, criterion="bic")

print(forward_bic_vars_full_pca, backward_bic_vars_full_pca)
['polygenic_score', 'env_index', 'sex'] ['env_index', 'polygenic_score', 'sex']
['polygenic_score', 'env_index', 'sex'] ['env_index', 'polygenic_score', 'sex']

Making sure the vars are selected properly for the forward and backwards selection

In [330]:
 # Fit forward AIC model
lr_fwd_aic_baseline = LinearRegression()
results_reg.append(
    evaluate_and_graph_reg(lr_fwd_aic_baseline, X_train_full[forward_aic_vars_baseline], y_train, X_test_full[forward_aic_vars_baseline], y_test, f"Forward AIC Baseline: {forward_aic_vars_baseline}", True)
)
--- Forward AIC Baseline: ['polygenic_score'] ---
Train RMSE: 0.7544 | Train R²: 0.4409
Test  RMSE: 0.7349 | Test  R²: 0.4364
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image

Because there was only one variable, we only need to run one of the options for the model since forward and backward selections (both AIC and BIC) will be the same.

The Forward AIC Subset Selection (Baseline) Linear Regression model demonstrates a moderately strong linear relationship between the selected predictor ($\text{polygenic\_score}$) and the True Values, achieving a Test $R^2$ of $0.436$. This means the Polygenic Score alone accounts for about $43.6\%$ of the variance in the predicted trait. The model is highly stable with low error ($\text{Test RMSE} \approx 0.440$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered around zero) and the Residual Distribution (which is perfectly normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, confirming the model adheres perfectly to the assumption of normally distributed errors.

In [331]:
# Fit forward AIC model
lr_fwd_aic_full = LinearRegression()
results_reg.append(
    evaluate_and_graph_reg(lr_fwd_aic_full, X_train_full[forward_aic_vars_full], y_train, X_test_full[forward_aic_vars_full], y_test, f"Forward AIC Full: {forward_aic_vars_full}", True)
)

# Fit backward AIC model
lr_bwd_aic_full= LinearRegression()
if backward_aic_vars_full:
    results_reg.append(
        evaluate_and_graph_reg(lr_bwd_aic_full, X_train_full[backward_aic_vars_full], y_train, X_test_full[backward_aic_vars_full], y_test, f"Backward AIC Full: {backward_aic_vars_full}", True)
    )
else:
    print(f"--- Backward AIC Full: Skipped (AIC recommended intercept-only model) ---")
    results_reg.append({
        "model": f"Backward AIC Full: Intercept Only",
        "rmse_train": np.nan,
        "rmse_test": np.nan,
        "r2_train": np.nan,
        "r2_test": np.nan,
    })
--- Forward AIC Full: ['polygenic_score', 'env_index', 'sex'] ---
Train RMSE: 0.6647 | Train R²: 0.5659
Test  RMSE: 0.6446 | Test  R²: 0.5663
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image
--- Backward AIC Full: ['env_index', 'polygenic_score', 'sex'] ---
Train RMSE: 0.6647 | Train R²: 0.5659
Test  RMSE: 0.6446 | Test  R²: 0.5663
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image

The Forward AIC Subset Selection (Full) Linear Regression model demonstrates a strong linear relationship between the selected predictors (which the model optimized via AIC) and the True Values, achieving a Test $R^2$ of $0.566$. This means the features selected by the AIC process account for approximately $56.6\%$ of the variance in the predicted trait. The model is highly stable with low error ($\text{Test RMSE} \approx 0.645$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered around zero) and the Residual Distribution (which is perfectly normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, confirming the model adheres perfectly to the assumption of normally distributed errors.

The Backward AIC Subset Selection (Full) Linear Regression model demonstrates a strong linear relationship between the selected predictors (which the model optimized via AIC) and the True Values, achieving a Test $R^2$ of $0.566$. This means the features selected by the AIC process account for approximately $56.6\%$ of the variance in the predicted trait. The model is highly stable with low error ($\text{Test RMSE} \approx 0.645$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered around zero) and the Residual Distribution (which is perfectly normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, confirming the model adheres perfectly to the assumption of normally distributed errors.

Both the Forward and Backward AIC Subset Selection methods achieved identical, high performance, resulting in a Test $R^2$ of $0.566$ and low error ($\text{Test RMSE} \approx 0.645$). This indicates that the feature subsets chosen by both optimization processes account for the same strong proportion of variance in the predicted trait, and both resulting linear models are statistically stable and unbiased. The feature sets selected by both AIC methods are functionally identical, leading to identical model performance.

In [332]:
# Fit forward BIC model
lr_fwd_bic_full_pca = LinearRegression()
results_reg.append(
    evaluate_and_graph_reg(lr_fwd_bic_full_pca, X_train_full[forward_bic_vars_full_pca], y_train, X_test_full[forward_bic_vars_full_pca], y_test, f"Forward BIC Full PCA: {forward_bic_vars_full_pca}", True)
)

# Fit backward BIC model
lr_bwd_bic_full_pca = LinearRegression()
if backward_bic_vars_full_pca:
    results_reg.append(
        evaluate_and_graph_reg(lr_bwd_bic_full_pca, X_train_full[backward_bic_vars_full_pca], y_train, X_test_full[backward_bic_vars_full_pca], y_test, f"Backward BIC Full PCA: {backward_bic_vars_full_pca}", True)
    )
else:
    print(f"--- Backward BIC Full PCA: Skipped (BIC recommended intercept-only model) ---")
    results_reg.append({
        "model": f"Backward BIC Full PCA: Intercept Only",
        "rmse_train": np.nan,
        "rmse_test": np.nan,
        "r2_train": np.nan,
        "r2_test": np.nan,
    })
--- Forward BIC Full PCA: ['polygenic_score', 'env_index', 'sex'] ---
Train RMSE: 0.6647 | Train R²: 0.5659
Test  RMSE: 0.6446 | Test  R²: 0.5663
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image
--- Backward BIC Full PCA: ['env_index', 'polygenic_score', 'sex'] ---
Train RMSE: 0.6647 | Train R²: 0.5659
Test  RMSE: 0.6446 | Test  R²: 0.5663
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image

The Forward BIC Subset Selection ($\text{Full + PCA, selected 4 components}$) Linear Regression model demonstrates a strong linear relationship between the selected predictors and the True Values, achieving a Test $R^2$ of $0.566$. This means the features selected by the BIC process account for approximately $56.6\%$ of the variance in the predicted trait. The model is highly stable with low error ($\text{Test RMSE} \approx 0.645$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered around zero) and the Residual Distribution (which is perfectly normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, confirming the model adheres perfectly to the assumption of normally distributed errors.

The Backward BIC Subset Selection ($\text{Full + PCA, selected 4 components}$) Linear Regression model demonstrates a strong linear relationship between the selected predictors and the True Values, achieving a Test $R^2$ of $0.566$. This means the features selected by the BIC process account for approximately $56.6\%$ of the variance in the predicted trait. The model is highly stable with low error ($\text{Test RMSE} \approx 0.645$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered around zero) and the Residual Distribution (which is perfectly normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, confirming the model adheres perfectly to the assumption of normally distributed errors.

Both the Forward and Backward BIC Subset Selection methods achieved identical, high performance, resulting in a Test $R^2$ of $0.566$ and low error ($\text{Test RMSE} \approx 0.645$). This indicates that the feature subsets chosen by both optimization processes account for the same strong proportion of variance in the predicted trait, and both resulting linear models are statistically stable and unbiased. The feature sets selected by both AIC methods are functionally identical, leading to identical model performance.

Shrinkage Methods (Ridge, Lasso, Elastic Net)¶

In [333]:
from sklearn.linear_model import RidgeCV

alphas = np.logspace(-4, 4, 200)

ridge_cv = RidgeCV(alphas=alphas)
ridge_cv.fit(X_train_full, y_train)

ridge_alpha = ridge_cv.alpha_

results_ridge = evaluate_and_graph_reg(ridge_cv, X_train_full, y_train, X_test_full, y_test, f"Ridge (alpha={ridge_alpha:.4f})", True)
results_reg.append(results_ridge)
--- Ridge (alpha=4.6059) ---
Train RMSE: 0.6646 | Train R²: 0.5660
Test  RMSE: 0.6444 | Test  R²: 0.5666
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image

The Ridge Regression model demonstrates a strong linear relationship between the predictors and the True Values, achieving a Test $R^2$ of $0.566$. This means the model accounts for approximately $56.6\%$ of the variance in the predicted trait. The model is highly stable with low error ($\text{Test RMSE} \approx 0.644$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered around zero) and the Residual Distribution (which is perfectly normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, confirming the model adheres perfectly to the assumption of normally distributed errors, while the $\text{alpha}$ regularization helps mitigate the effects of multicollinearity (VIF $> 10$) observed in the original features.

In [334]:
from sklearn.linear_model import LassoCV

lasso_cv = LassoCV(
    cv=10,
    max_iter=5000,
    random_state=SEED
)
lasso_cv.fit(X_train_full, y_train)

lasso_alpha = lasso_cv.alpha_

results_lasso = evaluate_and_graph_reg(lasso_cv, X_train_full, y_train, X_test_full, y_test, f"Lasso (alpha={lasso_alpha:.4f})", True)
results_reg.append(results_lasso)
--- Lasso (alpha=0.0007) ---
Train RMSE: 0.6646 | Train R²: 0.5660
Test  RMSE: 0.6444 | Test  R²: 0.5666
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image

The Lasso Regression model demonstrates a strong linear relationship between the predictors and the True Values, achieving a Test $R^2$ of $0.566$. This means the model accounts for approximately $56.6\%$ of the variance in the predicted trait. The model is highly stable with low error ($\text{Test RMSE} \approx 0.644$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered around zero) and the Residual Distribution (which is perfectly normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, confirming the model adheres perfectly to the assumption of normally distributed errors, while the $\text{alpha}$ regularization helps mitigate the effects of multicollinearity (VIF $> 10$) and perform feature selection by driving some coefficients to exactly zero.

In [335]:
from sklearn.linear_model import ElasticNetCV

elastic_cv = ElasticNetCV(
    l1_ratio=np.linspace(0.01, 0.99, 20),
    cv=10,
    max_iter=5000,
    random_state=SEED
)
elastic_cv.fit(X_train_full, y_train)

elastic_alpha = elastic_cv.alpha_
elastic_l1 = elastic_cv.l1_ratio_

results_elastic = evaluate_and_graph_reg(elastic_cv, X_train_full, y_train, X_test_full, y_test, f"ElasticNet (alpha={elastic_alpha:.4f}, l1_ratio={elastic_l1:.2f})", True)
results_reg.append(results_elastic)
--- ElasticNet (alpha=0.0007, l1_ratio=0.99) ---
Train RMSE: 0.6646 | Train R²: 0.5660
Test  RMSE: 0.6444 | Test  R²: 0.5666
✅ Model seems reasonably balanced
------------------------------
No description has been provided for this image

The ElasticNet Regression model demonstrates a strong linear relationship between the predictors and the True Values, achieving a Test $R^2$ of $0.567$. This means the model accounts for approximately $56.7\%$ of the variance in the predicted trait. The model is highly stable with low error ($\text{Test RMSE} \approx 0.644$) and shows no significant signs of bias or non-linearity, as confirmed by the Residuals vs Fitted plot (which is randomly scattered around zero) and the Residual Distribution (which is perfectly normally distributed). The QQ Plot also shows the residuals closely follow the theoretical line, confirming the model adheres perfectly to the assumption of normally distributed errors, while the regularization helps mitigate multicollinearity and performs feature selection simultaneously.

In [336]:
shrinkage_df = pd.DataFrame([
    results_ridge,
    results_lasso,
    results_elastic
])
shrinkage_df
Out[336]:
model rmse_train rmse_test r2_train r2_test
0 Ridge (alpha=4.6059) 0.664629 0.644411 0.565998 0.566574
1 Lasso (alpha=0.0007) 0.664632 0.644428 0.565996 0.566551
2 ElasticNet (alpha=0.0007, l1_ratio=0.99) 0.664632 0.644428 0.565996 0.566551

This table summarizes the performance of three regularized linear regression models (Ridge, Lasso, and ElasticNet), all of which achieved nearly identical and high predictive power with a Test $R^2$ of approximately $0.566$. This consistency, combined with the balanced Train and Test RMSE values ($\approx 0.644$), indicates that regularization successfully prevented overfitting and maintained model stability across all three methods.

In [337]:
coef_df = pd.DataFrame({
    "OLS": lr_full.coef_,
    "Ridge": ridge_cv.coef_,
    "Lasso": lasso_cv.coef_,
    "ElasticNet": elastic_cv.coef_
}, index=X_train_full.columns)
coef_df
Out[337]:
OLS Ridge Lasso ElasticNet
age 0.009250 0.009252 0.008608 0.008608
env_index 0.345260 0.345044 0.344601 0.344599
polygenic_score 0.657166 0.656744 0.656540 0.656535
sex 0.188571 0.188083 0.185910 0.185905

Summary of Findings¶

  • Model Performance: All four models—OLS, Ridge, Lasso, and ElasticNet—achieved essentially identical, high performance metrics, with a Test $R^2$ of approximately $0.566$ and a low, consistent Test RMSE of about $0.644$.
  • Feature Importance (Coefficients): Across all models, the polygenic_score has the largest coefficient (around $0.657$), indicating it is the strongest predictor of the target variable, followed by the env_index (around $0.345$). Sex and age have comparatively smaller coefficients (around $0.188$ and $0.009$, respectively).

Key Observation¶

The high similarity in performance and the near-identical coefficients across all models suggest that regularization (Ridge, Lasso, ElasticNet) did not significantly alter the coefficient magnitudes compared to the simple OLS model. This implies that the original OLS model was already well-conditioned, with little need for penalization, and had minimal signs of overfitting or multicollinearity.

In [338]:
plt.figure(figsize=(10,6))
coef_df.plot(kind="bar")
plt.title("Coefficient Estimates Across Methods")
plt.ylabel("Coefficient Value")
plt.xticks(rotation=45)
plt.show()
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image

The provided data shows that the performance metrics across all four models (OLS, Ridge, Lasso, and ElasticNet) are virtually identical and consistently strong, with a Test $R^2$ of approximately $0.566$ and a low Test RMSE of about $0.644$. The coefficients for the predictor variables are also nearly indistinguishable across the models , with polygenic_score being the strongest predictor (coefficient $\approx 0.657$) and age being the weakest (coefficient $\approx 0.009$). This strong consistency suggests that the original OLS model was already well-conditioned, and the addition of L1/L2 regularization did not meaningfully alter the feature coefficients or overall predictive power.

Regression Tree¶

In [339]:
from sklearn.tree import DecisionTreeRegressor

# 1. Baseline: PRS only
reg_baseline = DecisionTreeRegressor(random_state=SEED, max_depth=10)
results_reg.append(
    evaluate_and_graph_reg(reg_baseline, X_train_baseline, y_train, X_test_baseline, y_test, "Regression Tree (PRS only)", True)
)

# 2. Full model
reg_full = DecisionTreeRegressor(random_state=SEED, max_depth=10)
results_reg.append(
    evaluate_and_graph_reg(reg_full, X_train_full, y_train, X_test_full, y_test, "Regression Tree (PRS + covariates)", True)
)

# 3. Full + PCA
reg_full_pca = DecisionTreeRegressor(random_state=SEED, max_depth=10)
results_reg.append(
    evaluate_and_graph_reg(reg_full_pca, X_train_full_pca, y_train, X_test_full_pca, y_test, "Regression Tree + PCA", True)
)
--- Regression Tree (PRS only) ---
Train RMSE: 0.6923 | Train R²: 0.5291
Test  RMSE: 0.7863 | Test  R²: 0.3546
⚠️ Warning: Signs of Overfitting (Train R² much higher than Test R²)
------------------------------
No description has been provided for this image
--- Regression Tree (PRS + covariates) ---
Train RMSE: 0.5490 | Train R²: 0.7039
Test  RMSE: 0.7650 | Test  R²: 0.3892
⚠️ Warning: Signs of Overfitting (Train R² much higher than Test R²)
------------------------------
No description has been provided for this image
--- Regression Tree + PCA ---
Train RMSE: 0.5392 | Train R²: 0.7143
Test  RMSE: 0.7646 | Test  R²: 0.3899
⚠️ Warning: Signs of Overfitting (Train R² much higher than Test R²)
------------------------------
No description has been provided for this image

The Regression Tree ($\text{PRS only}$) Model demonstrates a weak-to-moderate non-linear relationship between the Polygenic Score and the True Values, achieving a Test $R^2$ of $0.355$. This $R^2$ is significantly lower than the $R^2 \approx 0.436$ achieved by the simple Linear Regression model using the same single predictor, indicating the linear model is a better fit. The model exhibits Signs of Overfitting, as explicitly warned in the output: the Train $R^2$ ($0.529$) is much higher than the Test $R^2$ ($0.355$). This shows the model learned the training data's specific patterns too well and does not generalize effectively. The model also shows clear non-normality in the residuals (visible in the Residual Distribution and the $\text{QQ Plot}$ curving away from the line), which is a violation of key assumptions for this type of analysis.

The Regression Tree ($\text{PRS + covariates}$) Model demonstrates a moderate non-linear relationship between the full set of predictors and the True Values, achieving a Test $R^2$ of $0.389$. This $R^2$ is significantly lower than the $R^2 \approx 0.567$ achieved by the Linear Regression full models, indicating that the overall relationship is best modeled linearly.The model exhibits Signs of Overfitting, as explicitly warned in the output: the Train $R^2$ ($0.7039$) is much higher than the Test $R^2$ ($0.3892$). This shows the model learned the training data's specific, complex patterns too well and does not generalize effectively. The model also shows clear non-normality in the residuals (visible in the Residual Distribution and the $\text{QQ Plot}$ curving away from the line), which is a violation of key assumptions for this type of analysis, further demonstrating its unsuitability compared to the highly linear models.

The Regression Tree ($\text{PCA}$) Model demonstrates a weak-to-moderate non-linear relationship between the PCA components and the True Values, achieving a Test $R^2$ of $0.390$. This $R^2$ is significantly lower than the $R^2 \approx 0.567$ achieved by the Linear Regression models, indicating the non-linear tree model is an inferior fit for the variance captured by the principal components.The model exhibits Signs of Overfitting, as explicitly warned in the output: the Train $R^2$ ($0.7143$) is much higher than the Test $R^2$ ($0.3899$). This shows the model learned the training data's specific, complex patterns too well and does not generalize effectively. The model also shows clear non-normality in the residuals (visible in the Residual Distribution and the $\text{QQ Plot}$ curving away from the line), which is a violation of key assumptions for this type of analysis, further confirming the model's structural limitations for this dataset.

All these models have overfit the data and should likely be not used unless parameters modified and tested again to prevent overfitting.

Results Regression¶

In [340]:
df_reg = pd.DataFrame(results_reg)
pick_best_model_reg(df_reg, overfit_threshold=0.05)
Total Models: 16
Valid Models: 13
Disqualified Models: 3

⚠️ Disqualified for overfitting:
model r2_test rmse_test r2_train overfitting_gap
13 Regression Tree (PRS only) 0.354637 0.786335 0.529077 0.174440
14 Regression Tree (PRS + covariates) 0.389185 0.764998 0.703871 0.314686
15 Regression Tree + PCA 0.389889 0.764557 0.714300 0.324412
Best by R² (Test Set):
model rmse_train rmse_test r2_train r2_test overfitting_gap is_overfit
2 LR Full + PCA 0.664538 0.644339 0.566118 0.566671 -0.000552 False
Best by RMSE (Test Set):
model rmse_train rmse_test r2_train r2_test overfitting_gap is_overfit
2 LR Full + PCA 0.664538 0.644339 0.566118 0.566671 -0.000552 False
Final ranking (better at top):
model r2_test rmse_test overfitting_gap
2 LR Full + PCA 0.566671 0.644339 -0.000552
10 Ridge (alpha=4.6059) 0.566574 0.644411 -0.000576
1 LR Full (PRS + covariates) 0.566574 0.644411 -0.000575
3 RFE Subset Selection (Full, 4 feats) 0.566574 0.644411 -0.000575
11 Lasso (alpha=0.0007) 0.566551 0.644428 -0.000555
12 ElasticNet (alpha=0.0007, l1_ratio=0.99) 0.566551 0.644428 -0.000555
6 Forward AIC Full: ['polygenic_score', 'env_ind... 0.566274 0.644634 -0.000359
8 Forward BIC Full PCA: ['polygenic_score', 'env... 0.566274 0.644634 -0.000359
7 Backward AIC Full: ['env_index', 'polygenic_sc... 0.566274 0.644634 -0.000359
9 Backward BIC Full PCA: ['env_index', 'polygeni... 0.566274 0.644634 -0.000359
4 RFE Subset (Full + PCA, selected 4 components) 0.470886 0.712000 0.005638
0 LR Baseline (PRS only) 0.436370 0.734856 0.004531
5 Forward AIC Baseline: ['polygenic_score'] 0.436370 0.734856 0.004531
🏆 Best regression model: LR Full + PCA
Out[340]:
'LR Full + PCA'
  1. Best Performers (Top Tier: $R^2 \approx 0.567$) The highest predictive accuracy was achieved by a large group of models that leveraged the full feature set. These models are clustered around a Test $R^2$ of $0.567$ and a Test RMSE of $0.644$, demonstrating that about $56.7\%$ of the variance in the Quantitative Trait is explained by the full set of risk factors.
  • Best Model (by a small margin): LR Full + PCA (Linear Regression with Full features and Principal Components). This model achieved the top spot primarily due to its perfect stability.
  • Highly Competitive Models (Statistically Identical Performance):
    • Ridge, Lasso, and ElasticNet: These regularized models achieved the same top $R^2$, confirming that their ability to handle the multicollinearity in the original features did not penalize predictive power.
    • LR Full (PRS + covariates) and LR Full + PCA: The inclusion of raw covariates or the Principal Components resulted in virtually the same performance, suggesting the PCA components effectively captured the same variance as the raw features.
    • RFE and AIC/BIC Subset Selection: Various selection methods (RFE, Forward/Backward AIC/BIC) also converged on a final model with the exact same performance, demonstrating their success in identifying the optimal predictor subset.
  1. Baseline and Sub-Optimal Models (Lower Tier: $R^2 < 0.50$)
  • Baseline LR Models ($R^2 \approx 0.436$): The linear models using only the $\text{polygenic\_score}$ (or the 10-feature baseline subset) performed significantly worse, indicating that the Environmental Index and Sex were essential for achieving the full $56.7\%$ explanatory power.
  • Regression Trees (Disqualified): The three Regression Tree models were the worst performers ($\text{Test } R^2$ in the $0.35$ to $0.39$ range) and were disqualified due to severe overfitting. This confirms that the relationship between predictors and the Quantitative Trait is fundamentally linear, and non-linear, high-variance tree-based models are a poor structural fit for this data.

The regression analysis validates the use of simple, stable, linear models for predicting the Quantitative Trait. The top predictive power of $R^2 \approx 0.567$ is achieved by models leveraging the full set of features, with LR Full + PCA being crowned the best due to its robust stability and use of uncorrelated components.